# Install the ASR workshop files (CIM)
Rolando Coto-Solano (Rolando.A.Coto.Solano@dartmouth.edu)<br>
Dartmouth College. Last update: 20250802

The program takes two main inputs:

* `folderToCreate`: This folder will be created in your Google Drive. It will contain the files and folders necessary for ASR training. The default is `202506-ood-asr` for the first test.<br>
* `sandboxes`: An array with the names of the sandboxes that will be used. It requires at least two: `sandbox-user` as a temporary bucket, and `all-wavs` as a permanent one. You can add more sandboxes if you have more than one person working on your transcriptions.<br>

The program will perform the following tasks:

1. Asks for permission to read and write into your Google Drive
2. Create folders and subfolders for each sandbox
3. Download exercise files
4. Downloads the computer code for the transcription training
5. Creates the Google Sheets for the transcriptions

## (1) Questions needed in order to install

In [None]:
folderToCreate = "202508-cim-asr"
sandboxes = ["sandbox-user", "all-wavs"]

downloadCIMAudioFiles = 1   # Set this 0 if you don't want to upload the wave files automatically

## (2) Request access to your google drive

In [None]:
# Load other libraries
import pandas as pd
import random
import os.path

# Load libraries for access to Google Spreadsheets
from google.colab import auth
auth.authenticate_user()
import gspread
from google.colab import drive

# It needs this permission to access the ASR spreadsheets in your GDrive
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)

In [None]:
# It needs this permission to read and write ASR files into your GDrive
drive.mount('/content/drive/',force_remount=True)

Mounted at /content/drive/


## (3) Create folders and download exercise files

In [None]:
# Example of the folder structure:

# -- 202506-ood-asr
#    | - all-wavs
#    |   | - logs-wav2vec2-res
#    |   | - tsv-inputs
#    |   | - tsv-outputs
#    |   | - wav
#    |   | - wav2vec2-model
#    | - sandbox-user
#    |   | - logs-wav2vec2-res
#    |   | - tsv-inputs
#    |   | - tsv-outputs
#    |   | - wav
#    |   | - wav2vec2-model

In [None]:
#===========================================================================
# Create folders
#===========================================================================

!mkdir /content/drive/MyDrive/{folderToCreate}

for b in sandboxes:
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/audiofiles-to-transcribe
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/logs-wav2vec2-res
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/logs-whisper-res
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/tsv-inputs
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/tsv-outputs
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/wav
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/wav2vec2-model
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/whisper-model

In [None]:
#===========================================================================
# Download workshop files
#===========================================================================

for b in sandboxes:
  if (downloadCIMAudioFiles == 1 and "sandbox-user" in b):

    !curl -o /content/drive/MyDrive/{folderToCreate}/{b}/audiofiles-to-transcribe/kia-orana-rehearsal.mp4 https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/kia-orana-rehearsal.mp4
    !curl -o /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files/RRBPKB1.wav https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/RRBPKB1.wav
    !curl -o /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files/RRMSAvaikiP24.wav https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/RRMSAvaikiP24.wav

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6066k  100 6066k    0     0  4946k      0  0:00:01  0:00:01 --:--:-- 4947k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  481M  100  481M    0     0  35.6M      0  0:00:13  0:00:13 --:--:-- 39.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  321M  100  321M    0     0  33.9M      0  0:00:09  0:00:09 --:--:-- 37.6M


## (4) Download Jupyter notebooks for the exercises

In [None]:
!curl -o /content/drive/MyDrive/{folderToCreate}/from-elan-to-wav-and-gsheet.ipynb https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/from-elan-to-wav-and-gsheet.ipynb
!curl -o /content/drive/MyDrive/{folderToCreate}/from-gsheet-to-wav2vec2-files.ipynb https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/from-gsheet-to-wav2vec2-files.ipynb

!curl -o /content/drive/MyDrive/{folderToCreate}/train-w2v2-lm-conda.ipynb https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/train-w2v2-lm-conda.ipynb
!curl -o /content/drive/MyDrive/{folderToCreate}/train-wav2vec2lm-miniconda-202505.py https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/train-wav2vec2lm-miniconda-202505.py

!curl -o /content/drive/MyDrive/{folderToCreate}/inference-transcribe-w2v2-longRecording.ipynb https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/inference-transcribe-w2v2-longRecording.ipynb
!curl -o /content/drive/MyDrive/{folderToCreate}/inference-transcribe-w2v2-from-user.ipynb https://rcweb.dartmouth.edu/RCoto/tocc-asr-workshop-202506/inference-transcribe-w2v2-from-user.ipynb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14063  100 14063    0     0  29927      0 --:--:-- --:--:-- --:--:-- 29985
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19748  100 19748    0     0  42678      0 --:--:-- --:--:-- --:--:-- 42652
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13975  100 13975    0     0  29814      0 --:--:-- --:--:-- --:--:-- 29861
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19531  100 19531    0     0  31274      0 --:--:-- --:--:-- --:--:-- 31249
  % Total    % Received % Xferd  Average Speed   Tim

## (5) Create Google Sheet Files

In [None]:
#=================================================================
# Write Google Sheet files
#=================================================================

idSandboxes = ["sandbox-user", "all-wavs"]

for b in sandboxes:
  sheetName = "asr-transcriptions-" + b
  sh = gc.create(sheetName)
  worksheet = gc.open(sheetName).sheet1
  worksheet.update_title('wav-metadata')
  inValues = ["wav_filename", "dataProcessedBy", "dateAdded", "speakerPrefix", "gender", "wav_filesize", "duration_seconds", "codeSwitch", "needsFurtherCheck", "transcript", "original_transcript"]
  print(str(b) + "\t" + str(sh.id))
  idSandboxes.append(sh.id)
  worksheet.append_row(inValues)

sandbox-user	16mLHkoWBj_uB7ISWOtmiuLj0UXGgh-M94u-qrYUNR7Q
all-wavs	1fsjpsrUEuD9PrMZhVOYwAzqGaveEFtpZOuYC5cUP9nk


In [None]:
#=================================================================
# Move the Google Sheet files to the installation folder
# (If you get an error that you "cannot stat" the file,
# give it a minute or two and try again. The drive might
# take a minute to update itself).
#=================================================================

drive.mount('/content/drive/',force_remount=True)

for b in sandboxes:
  sheetName = "asr-transcriptions-" + b
  !mv /content/drive/MyDrive/{sheetName}.gsheet /content/drive/MyDrive/{folderToCreate}

Mounted at /content/drive/
