<a href="https://colab.research.google.com/github/rgmartin/1-logistic_regression_classifier/blob/main/execution_framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install torchmetrics > /dev/null
!pip install pytorch-lightning > /dev/null
!pip install librosa > /dev/null
!pip install optuna > /dev/null

# Overall Script Description
Run through each of the different cells to get things working. What should happen is that you connect to Google Drive, some data is transferred and then unzipped and then the model is trained on it and outputs data indicating correct operation. On the Google Drive there should also be files added under "Measurements" which have a date and time-stamp corresponding to the output of training process. The time stamps are generated by the Google Colab instance so they won't match exactly with your local time depending on where the instance they supplied us is located.

The cells are broken up based on functionality to make it easier to debug/profile/troubleshoot any issues which might come up.

# Configure Directories/Paths/Languages
The following cell is where you configure which languages you want to work with (from the three which are available on Google Drive at the moment). It is also where you set the paths to where the Google Drive will be mounted and where the data is extracted to.


In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
languages = ["EN", "DE", "ES"]
mount_point = '/content/drive'
load_path = os.path.join(mount_point, 'MyDrive/ECSE-552-FP/Data/')
unzip_path = '/content/speech_data'
save_path = '/MyDrive/ECSE-552-FP/Measurements'

from google.colab import drive
drive.mount(mount_point,force_remount= True)

Mounted at /content/drive


# Download and extract the dataset zip files
The following downloads and extracts the language zip files from Google Drive. It is done separately in it's own cell to make it easier to profile/debug things if there are issues in the network connection between Google Drive and Google Colab. 

This is separate from dataset creation for two reasons. The first is that if the network accesses are obscured by accessing the data via the network drive, it is much harder to debug bottlenecks in creating the dataset. The second is that all of the data must be transferred once at somepoint during one epoch, so it's easier to do it all up front and make sure the data is on the Colab instance before training. Localizing as much as possible to the Colab instance during training makes things easier to debug as well as more efficient.

There is a commented out code which allows you to select between the "debug" dataset and the "full" dataset for the languages. The "debug" set consists of a smaller subset (20 samples for each language) to aid in debugging the dataflow in models. It alleviates the need to run through the entire dataset and helps ensure all the pipes are connected correctly.

In [4]:
for language in languages:
    language_dir_path = os.path.join(unzip_path, language)
    os.makedirs(language_dir_path, exist_ok=True)
    archive = language + "_debug_set" + ".zip"
    #archive = language + ".zip"
    !unzip -n {os.path.join(load_path, archive)} -d {language_dir_path} > /dev/null

# Download the source code from GitHub

In [5]:
user = "dgsmith1988"
repo = "ECSE-552-Final-Project"
src_dir = "Code"
pyfiles = ["models.py", "train.py", "dict_logger.py", "feature_extraction.py", "move_checkpoint.py"]

for pyfile in pyfiles:
    !rm {pyfile} > /dev/null
    url = f"https://raw.githubusercontent.com/{user}/{repo}/rubert/{src_dir}/{pyfile}"
    !wget {url} > /dev/null


--2022-03-23 17:03:45--  https://raw.githubusercontent.com/dgsmith1988/ECSE-552-Final-Project/rubert/Code/models.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5628 (5.5K) [text/plain]
Saving to: ‘models.py’


2022-03-23 17:03:45 (56.6 MB/s) - ‘models.py’ saved [5628/5628]

--2022-03-23 17:03:45--  https://raw.githubusercontent.com/dgsmith1988/ECSE-552-Final-Project/rubert/Code/train.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10023 (9.8K) [text/plain]
Saving to: ‘train.py’


2022-03-23 17:03:45 (77.7 MB/s) - ‘train.py’

# Install the supporting/required libraries



# Hyperparameter tuning

In [6]:
!rm -r ./Checkpoints

rm: cannot remove './Checkpoints': No such file or directory


In [7]:
import optuna
import train
from models import BaselineResnetClassifier

model = BaselineResnetClassifier(num_classes=3)
best_checkpoint_path, best_params_dict, study = train.hp_tuning_voxforge_classifier(model, data_dir=unzip_path)

Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth


  0%|          | 0.00/97.8M [00:00<?, ?B/s]

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
[32m[I 2022-03-23 17:04:06,837][0m A new study created in memory with name: no-name-288e65e3-9660-48ef-9619-f6ef7c3bab68[0m


Preparing and splitting dataset...


3it [00:00,  9.86it/s]
3it [00:00,  9.96it/s]
3it [00:00,  9.32it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Dataset creation in seconds:  0.962279250999984



  | Name           | Type     | Params
--------------------------------------------
0 | resnet50       | ResNet   | 25.6 M
1 | fc             | Linear   | 3.0 K 
2 | train_accuracy | Accuracy | 0     
3 | test_accuracy  | Accuracy | 0     
4 | val_accuracy   | Accuracy | 0     
--------------------------------------------
25.6 M    Trainable params
0         Non-trainable params
25.6 M    Total params
102.240   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

[32m[I 2022-03-23 17:04:22,323][0m Trial 0 finished with value: 0.25 and parameters: {'max_t': 3, 'batch_size': 12}. Best is trial 0 with value: 0.25.[0m


Preparing and splitting dataset...


3it [00:00,  9.14it/s]
3it [00:00,  8.15it/s]
3it [00:00,  7.13it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type     | Params
--------------------------------------------
0 | resnet50       | ResNet   | 25.6 M
1 | fc             | Linear   | 3.0 K 
2 | train_accuracy | Accuracy | 0     
3 | test_accuracy  | Accuracy | 0     
4 | val_accuracy   | Accuracy | 0     
--------------------------------------------
25.6 M    Trainable params
0         Non-trainable params
25.6 M    Total params
102.240   Total estimated model params size (MB)


Dataset creation in seconds:  1.1531232199999977


  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")


Validation sanity check: 0it [00:00, ?it/s]

  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

[32m[I 2022-03-23 17:04:25,099][0m Trial 1 finished with value: 0.5833333134651184 and parameters: {'max_t': 3, 'batch_size': 17}. Best is trial 1 with value: 0.5833333134651184.[0m


Preparing and splitting dataset...


3it [00:00, 13.64it/s]
3it [00:00, 14.78it/s]
3it [00:00, 13.59it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type     | Params
--------------------------------------------
0 | resnet50       | ResNet   | 25.6 M
1 | fc             | Linear   | 3.0 K 
2 | train_accuracy | Accuracy | 0     
3 | test_accuracy  | Accuracy | 0     
4 | val_accuracy   | Accuracy | 0     
--------------------------------------------
25.6 M    Trainable params
0         Non-trainable params
25.6 M    Total params
102.240   Total estimated model params size (MB)


Dataset creation in seconds:  0.6757145649999927


  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")


Validation sanity check: 0it [00:00, ?it/s]

  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

[32m[I 2022-03-23 17:04:26,846][0m Trial 2 finished with value: 0.4166666567325592 and parameters: {'max_t': 1, 'batch_size': 15}. Best is trial 1 with value: 0.5833333134651184.[0m


Number of finished trials: 3
Best trial:
  Value: 0.5833333134651184
  Params: 
    max_t: 3
    batch_size: 17


In [10]:
from optuna.visualization import *

In [None]:
# NExt, move the desired file, and assigne a new meanigful_name. 
# it will be properly stored in the folder Checkpoints
from move_checkpoint import move_checkpoint
move_checkpoint(best_checkpoint_path, best_params_dict,study,'best_check_point')

Running on Colab


# Run/Train the Model

In [None]:
import train
from models import BaselineResnetClassifier

model = BaselineResnetClassifier(num_classes=3)
train.train_voxforge_classifier(model, data_dir=unzip_path)