# Computational modeling of infants’ early phonetic learning across languages

This is the workspace for Haozhe Sun's intern project, during the period of 2022-06-06 to 2022-07-31, at the QARMA research group of the LIS laboratory, in Marseille, France.

In [None]:
!pip install pytube
!pip install pydub
!pip install googletrans==3.1.0a0
!pip install datasets
!pip install ujson

## 1. Searching via YouTube Data API

We scrape stories for children on YouTube in different languages to get enough training data.

Inspired from https://github.com/youtube/api-samples/blob/master/python/search.py

API: https://cloud.google.com/console

Relevant API doc: https://developers.google.com/youtube/v3/docs/search/list

**WARNING: the developer key should never be pushed to github.**

In [None]:
from googleapiclient.discovery import build

def get_api():
  DEVELOPER_KEY = ''
  YOUTUBE_API_SERVICE_NAME = 'youtube'
  YOUTUBE_API_VERSION = 'v3'
  youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
                  developerKey=DEVELOPER_KEY)
  return youtube

def search(youtube, **kwargs):
  search_response = youtube.search().list(**kwargs).execute()
  return search_response

In [None]:
import ujson as json
from googleapiclient.errors import HttpError
from googletrans import Translator

nb_pages_per_lang = 3 # the daily quota is 100 search pages, that means ~3 pages per language = ~150 items per language = ~ 10-50 hours per language before filtering

maxResults = 50  # max allowed by page
part = 'id, snippet'
item_type = 'video'
duration = "medium"

translator = Translator()

# ISO-639-1 code, see http://www.loc.gov/standards/iso639-2/php/code_list.php
lang_ids = {
    "afrikaans": "af", 
    "arabic": "ar", 
    "chinese": "zh-CN", # googletrans' convention for detecting language
    "croatian": "hr", 
    "czech": "cs", 
    "german": "de", 
    "estonian": "et", 
    "english": "en", 
    "spanish": "es", 
    "french": "fr", 
    "greek": "el", 
    "hebrew": "iw", # the new standard is "he", acceptable input, but not the output convention of googletrans
    "korean": "ko", 
    "hungarian": "hu", 
    "icelandic": "is",
    "indonesian": "id", 
    "hindi": "hi", 
    "italian": "it", 
    "japanese": "ja", 
    "lithuanian": "lt", 
    "malay": "ms", 
    "dutch": "nl", 
    "maori": "mi", 
    "polish": "pl", 
    "portuguese": "pt", 
    "romanian": "ro", 
    "swedish": "sv", 
    "telugu": "te", 
    "turkish": "tr"
}
          
# languages names in English to avoid Unicode complications
# available MBROLA voices without Iranian as it doesn't have a ISO-639-1 code, and breton which is not supported by googletrans, and latin which has no query result on YouTube
# in total 29 languages
# no difference in dialects as they won't matter in a grapheme-based video query
langs = ["afrikaans", "arabic", "chinese", "croatian", "czech", "german", "estonian", "english", "spanish", "french", "greek", "hebrew", "korean", "hungarian", "icelandic",
    "indonesian", "hindi", "italian", "japanese", "lithuanian", "malay", "dutch", "maori", "polish", "portuguese", "romanian", "swedish", "telugu", "turkish"]
lang_qs = {lang: translator.translate('stories for children', dest=lang_ids[lang]).text for lang in langs}

youtube = get_api()
res = {lang: [] for lang in langs}

for lang in langs:
  for page in range(nb_pages_per_lang):
    try:
      if page == 0:
        res[lang].append(search(youtube, maxResults=maxResults, part=part, type=item_type, q=lang_qs[lang], videoDuration=duration))
      else:
        pageToken = res[lang][-1]['nextPageToken']
        res[lang].append(search(youtube, pageToken=pageToken, maxResults=maxResults, part=part, type=item_type, q=lang_qs[lang], videoDuration=duration))
    except HttpError as e:
      print(f'An HTTP error occurred for {lang}: {e.resp.status}: {e.content}')

with open("results.json", 'w') as f:
  json.dump(res, f)

## 2. Constructing training dataset

In [None]:
# from result.json to videos.json
# we further filter search results by inspecting the language of title and description
from tqdm import tqdm
import os

with open("results.json", 'r') as f:
  res = json.load(f)

if not os.path.exists("./videos"):
  os.mkdir("./videos")

translator = Translator()

for lang in tqdm(langs):
  video_dict = {lang: []}
  for search in res[lang]:
    for item in search["items"]:
      if translator.detect(item["snippet"]["title"]).lang == lang_ids[lang] and translator.detect(item["snippet"]["description"]).lang == lang_ids[lang]:
        video_dict[lang].append(item["id"]["videoId"])

  with open(f"./videos/videos_{lang}.json", 'w') as g: # if you are on colab, download and conserve these files! (or download the results.json file if you like)
    json.dump(video_dict, g)

We construct the dataset by feature extraction: we use librosa to automatically calculate mel spectrograms as input features.

In the first part of this project we ignore all ordering/positional information in a given audio file and across audio files. As a result, we use all feature vectors of every window, as a whole bag of vectors, as the input data to the model; for this process we use HuggingFace datasets and do batch mapping on them.

In [None]:
from pytube import YouTube
from moviepy.editor import *
import pydub
import numpy as np
import os
import ujson as json

# the speed of video processing: ~80 videos per hour

# inspired from https://stackoverflow.com/questions/53633177/how-to-read-a-mp3-audio-file-into-a-numpy-array-save-a-numpy-array-to-mp3
# we can also try y, sr = librosa.load(mp3_file) and see which one is more efficient (test if normalizing affects the final mel-spectrogram)
def mp3tonumpy(f):
  a = pydub.AudioSegment.from_mp3(f).set_channels(1)
  y = np.array(a.get_array_of_samples())
  return np.float32(y) / 2**15

with open("./videos/videos_afrikaans.json", 'r') as h: # here we have split the videos.json file to prevent the disk from being occupied entirely
  video_dict = json.load(h)

for lang in video_dict:
  empty_dict = {
      "data": []
  }
  count = 0
  for video in video_dict[lang]:
    # search for a certain youtube video and download it
    video = f"https://www.youtube.com/watch?v={video}"
    obj = YouTube(video)
    obj.streams.filter(only_audio=True, file_extension="mp4").order_by('abr').desc().last().download()

    mp4_file = None
    mp3_file = None

    for f in os.listdir():
      if f.endswith(".mp4"):
        mp4_file = f
        mp3_file = f.replace(".mp4", ".mp3")

    # convert .mp4 to .mp3
    audioclip = AudioFileClip(mp4_file)
    audioclip.write_audiofile(mp3_file)
    audioclip.close()

    # remove the original .mp4 file
    os.remove(mp4_file)

    # convert .mp3 file to numpy array
    x = mp3tonumpy(mp3_file)

    # we can't save a big numpy array as one element of csv file, as it will be treated as a string and all center elements will be replaced by '...'
    # even if we try to store them as a row, it's difficult to handle data with different length when HuggingFace reads csv files
    # so we use json file to store our numpy arrays (we still need to convert them to a list as json can't encode numpy arrays)
    data = {}
    data["audio"] = x.tolist()
    empty_dict["data"].append(data)

    os.remove(mp3_file)
    if not os.path.exists("./data"):
      os.mkdir("./data")

    # save the file every 5 videos
    count += 1
    if count % 5 == 0:
      with open(f"./data/{lang}_{count // 5 - 1}.json", 'w') as f:
        json.dump(empty_dict, f)
      index = count // 5 - 1
      !gzip ./data/{lang}_{index}.json
      empty_dict = {
        "data": []
    }
  if empty_dict["data"] != []:
    with open(f"./data/{lang}_{count // 5}.json", 'w') as f:
      json.dump(empty_dict, f)
    index = count // 5
    !gzip ./data/{lang}_{index}.json

In [None]:
# on colab, we upload to HuggingFace Hub in order not to lose progress
from huggingface_hub import notebook_login, create_repo, Repository

notebook_login()

In [None]:
repo_url = create_repo(name="youtube-multilingual-child-stories-44.1kHz", repo_type="dataset")
repo = Repository(local_dir="youtube-multilingual-child-stories-44.1kHz", clone_from=repo_url)
!cp ./data/ youtube-multilingual-child-stories-44.1kHz/
repo.push_to_hub(private=True)

In [None]:
!pip install datasets

In [None]:
import librosa
import numpy as np

def transform(examples):
  processed_data = np.zeros((0, 128))
  for data in examples["audio"]:
    mel_spectrogram = librosa.feature.melspectrogram(y=np.asarray(data), sr=44100, n_mels=128)
    processed_data = np.concatenate((processed_data, mel_spectrogram.T), axis=0)
  return {"data": processed_data}

In [None]:
from datasets import load_dataset

raw_dataset = load_dataset("sunhaozhepy/youtube-multilingual-child-stories-44.1kHz", data_files="data/afrikaans_0.json.gz", field="data", split="train") # on colab do this
# we can use data_files="data/afrikaans_*.json.gz" to load every json.gz file
# raw_dataset = load_dataset("json", data_files="data/afrikaans_0.json.gz", field="data", split="train") # locally do this

In [None]:
dataset = raw_dataset.map(transform, remove_columns=raw_dataset.column_names, batched=True, batch_size=64)
dataset.set_format("torch")
print(len(dataset))

split_dataset = dataset.train_test_split(test_size=0.1)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

train_dataloader = DataLoader(split_dataset["train"], batch_size=64, shuffle=True)
test_dataloader = DataLoader(split_dataset["test"], batch_size=64, shuffle=False)

## 3. Training models for different languages

This task being unsupervised representation learning, we implement three popular models of this kind: PCA, linear Autoencoder, and deep Autoencoder.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

### 3.1 PCA

Note that normally PCA is fit in an offline, matrix-factorization-based way, but to scale things up, we have to think of a way to do online training. That is to say, we sample a mini-batch of data in our dataset, and do gradient descent (SGD) on this batch of data. This problem is known in literature as streaming PCA or streaming k-PCA where k stands for the k leading eigenvectors of the data covariance matrix.

To do so, we use Oja's algorithm (see the paper for details).

### 3.2 Linear Autoencoder

Autoencoders are artificial neural networks that consist of an encoder and a decoder: the encoder projects the input to a feature space, and the decoder projects the encoding back to the input space with the least reconstruction error possible.

To compare with PCA, we first implement a linear autoencoder with one hidden layer, which is equivalent to a PCA model with no orthogonality/norm constraint.

In [None]:
class LinearAutoEncoder(nn.Module):
  def __init__(self):
    super().__init__()
    self.encoder = nn.Linear(128, 8)
    self.decoder = nn.Linear(8, 128)

  def forward(self, x):
    encoding = self.encoder(x)
    reconstructed_x = self.decoder(encoding)
    return encoding, reconstructed_x

### 3.3 Deep Autoencoder

SImilar as above, we apply a full autoencoder with a deep encoder-decoder architecture, and with activation layers.

In [None]:
class AutoEncoder(nn.Module):
  def __init__(self):
    super().__init__()
    self.encoder = nn.Sequential(
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.ReLU(),
        nn.Linear(32, 16),
        nn.ReLU(),
        nn.Linear(16, 8)
    )
    self.decoder = nn.Sequential(
        nn.Linear(8, 16),
        nn.ReLU(),
        nn.Linear(16, 32),
        nn.ReLU(),
        nn.Linear(32, 64),
        nn.ReLU(),
        nn.Linear(64, 128),
    )

  def forward(self, x):
    encoding = self.encoder(x)
    reconstructed_x = self.decoder(encoding)
    return encoding, reconstructed_x

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"using {device}.")

model = AutoEncoder().to(device)
  
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

step = 0
best_val_loss = float("inf")
train_loss = 0
for batch in train_dataloader:
  model.train()
  batch = batch["data"].to(device) 
  optimizer.zero_grad()
  _, outputs = model(batch)
  loss = loss_fn(outputs, batch)
  train_loss += loss
  loss.backward()
  optimizer.step()
  step += 1

  if step % 50 == 0:
    print(f"step {step}, training loss = {train_loss / 50}")
    train_loss = 0

  if step % 500 == 0:
    print("validation loop...")
    eval_loss = 0
    model.eval()
    for eval_batch in test_dataloader:
      with torch.no_grad():
        eval_batch = eval_batch["data"].to(device) 
        _, outputs = model(eval_batch)
        loss = loss_fn(outputs, eval_batch)
        eval_loss += loss
    eval_loss /= len(test_dataloader)
    print(f"validation loss = {eval_loss}")
    if eval_loss < best_val_loss:
      print("Saving checkpoint!")
      best_val_loss = eval_loss
      torch.save(model.state_dict(), f'/content/drive/My Drive/Colab Notebooks/checkpoints/autoencoder_afrikaans_best_file_0.pt') # model_language_"best"_dataset

## Storing results and anonymization

Just keep a list of videos url on disk + some hash of the wavs (along with search params and timestamp for search). We can keep language id (general or couple with vad/diarization), vad, (anonym) speaker diarization...

Publish the search procedure and code but do not keep any of the metadata and wav data (and caption).
Rationale is to avoid any issue with people who would like some public info about them to be removed and scientifically it means:

  1. we should check things work out with variants of training set
  2. this is like the idea that you cannot test the same persons twice in experiments in practice, but that your conclusions should hold if someone reproduces the experiments with other people.
  
This means retraining all relevant models every-time you want to compare them if some video became unavailable or hash changed.

One issue to keep in mind is that we don't want to use models where the training data could be reconstructed from the model parameters.
