## OyezDataPrep

This notebook is exclusively for scraping and processing our real audio and TTS speech data. All the Deep Learning happens in "FINAL - OyezTraining"

### References: 

Nemo: 


*   https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb#scrollTo=7mP4r1Gx_Ilt
*   https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tools/CTC_Segmentation_Tutorial.ipynb#scrollTo=hRFAl0gO92bp
* https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tts/1_TTS_inference.ipynb#scrollTo=-BB2-KokaP08
*   https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html

Please note our efforts usign these notebooks have already improved NeMo:
*   https://github.com/NVIDIA/NeMo/issues/2217#issuecomment-841738358
*   https://github.com/NVIDIA/NeMo/issues/2208

Critically, we adapted an API from the Supreme Court diarization paper in our report:
* https://github.com/JeffT13/rd-diarization

Thanks to Project Oyez for all the data: 
* https://www.oyez.org/



Note that outputs are cleared to minimize clutter

We load all dependencies and scripts we'll need from NeMo

In [None]:
BRANCH = 'r1.0.0rc1'
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
import json
import os
import wget

from IPython.display import Audio
import numpy as np
import scipy.io.wavfile as wav

! pip install pandas

# optional
! pip install plotly
from plotly import graph_objects as go

In [None]:
# If you're running the notebook locally, update the TOOLS_DIR path below
# In Colab, a few required scripts will be downloaded from NeMo github

import wget
TOOLS_DIR = '<UPDATE_PATH_TO_NeMo_root>/tools/ctc_segmentation/scripts'

if 'google.colab' in str(get_ipython()):
    TOOLS_DIR = 'scripts/'
    os.makedirs(TOOLS_DIR, exist_ok=True)

    required_files = ['prepare_data.py',
                    'normalization_helpers.py',
                    'run_ctc_segmentation.py',
                    'verify_segments.py',
                    'cut_audio.py',
                    'process_manifests.py',
                    'utils.py']
    for file in required_files:
        if not os.path.exists(os.path.join(TOOLS_DIR, file)):
            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/tools/ctc_segmentation/' + TOOLS_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DIR)
elif not os.path.exists(TOOLS_DIR):
      raise ValueError(f'update path to NeMo root directory')

### 1. Scrape and segment Oyez audio

In [None]:
## create data directory for the year we are scraping
WORK_DIR = 'WORK_DIR_2018'
DATA_DIR = WORK_DIR + '/DATA'
os.makedirs(DATA_DIR, exist_ok=True)

This version, adapted from the Speaker Diarization link above, finds all cases for a year from Oyez API and formats them so they can be run through NeMO's CTC segmentation script. Some cases do not have transcripts, so the json fails and we move to next case

In [None]:
from datetime import date
import traceback

import json

from urllib.request import urlopen

import requests
#from ratelimit import limits, sleep_and_retry

YEARS_TO_GO_BACK = 1


#

def get_http_json(url):
    print(f"Getting {url}")
    response = requests.get(url)
    parsed = response.json()
    return parsed

# this is the main function used, and it will download the audio file and create all the transcripts
# we get all the cases for a year and concatenate the json for each case and write it to a file
# then we pair it with it's audio file
def get_case(term, docket):
    """Get the info of the case and fetch all
    transcripts that the info links to"""
    url = f"https://api.oyez.org/cases/{term}/{docket}"
    docket_data = get_http_json(url)
    print(docket_data["opinion_announcement"])
    if ("opinion_announcement" in docket_data):
      opinion_announcement = docket_data["opinion_announcement"]
      docket_number = docket_data["docket_number"]
      t = opinion_announcement[0]["href"]
      audio_json = get_http_json(t)
      data =  urlopen(t).read()
      audio_link = audio_json["media_file"][0]["href"]
      audio_file = audio_link[audio_link.rfind("/")+1:]
      wget.download(audio_link, DATA_DIR)
      djsondict = json.loads(data)
      stop_time = 0
      full_text = ""
      for q in range(0,len(djsondict['transcript']['sections'])):
        for i in range(0,len(djsondict['transcript']['sections'][q]['turns'])):
          for j in range(0,len(djsondict['transcript']['sections'][q]['turns'][i]['text_blocks'])):
            temp_text = djsondict['transcript']['sections'][q]['turns'][i]['text_blocks'][j]['text'] + "\n"
            temp_text = temp_text.replace('v.','v')
            full_text = full_text + temp_text
      with open(os.path.join(DATA_DIR, audio_file.replace('mp3', 'txt')), 'w') as f:
        f.write(full_text)


def write_case(term, docket, docket_data, transcripts):
    """
    Writes term-docket.json file with docket_data
    For each transcript, writes the term-docket-t##.json file
    """
    with open(f"oyez/cases/{term}.{docket}.json", "w") as docket_file:
        json.dump(docket_data, docket_file, indent=2)

    count = 0
    for t in transcripts:
        count += 1
        t_filename = "oyez/cases/{}.{}-t{:0>2d}.json".format(term, docket, count)
        with open(t_filename, "w") as t_file:
            json.dump(t, t_file, indent=2)


# this calls the write_case function
def fetch_missing(cases):
    """
    cases is a map of tuples to Summary (term, docket) : {SUMMARY}
    For each case, fetch the docket and transcript data and write to a file
    
    return set of cases that this was succesful for
    """
    
    
    count = 0
    total = len(cases)
    succesful = set()
    for term, docket in cases.keys():
        ## pull the file
        count += 1
        print(term,docket)
        print(f"Trying: {term}/{docket}\t\t{count}/{total}")
        try:
            docket_data, transcripts = get_case(term, docket)
            if not transcripts:
                # No transcripts for this case yet
                continue
        except Exception as exc:
            traceback.print_exc()
            print(f"Failed for {term}/{docket}, continuing anyways")
    return succesful

# this builds a dict with all the cases for the year specified
def find_missing(years):
    """
    Fetch all summaries for given years and find any that are
    missing in the local "known_map"
    """
    to_fetch = {}
    for year in years:
        summary_url = f"https://api.oyez.org/cases?per_page=0&filter=term:{year}"
        summaries = get_http_json(summary_url)
        for summary in summaries:
          to_fetch[(summary["term"], summary["docket_number"])] = summary

    return to_fetch

# this specifies the year to start at, for example our test set would start at 2018 and include 2017 so we'd look back 2 years
def years_to_recheck():
    """
    Makes a list of years going back to
    YEARS_TO_GO_BACK
    e.g. [2018, 2019]
    """
    cur_year = 2018
    return list(range(cur_year - YEARS_TO_GO_BACK + 1, cur_year + 1))

# calls the functions above to download all audio and transcripts for 2018
def main():
    """
    Find any cases that the server is updated with but we don't have locally
    and fetch the case info and transcripts for them.
    For all cases this is succesful for, also update case_summaries
    """
    missing_summaries = find_missing(years_to_recheck())

    print(f"Missing {len(missing_summaries)} cases")
    print(missing_summaries.keys())

    fetch_missing(missing_summaries)

if __name__ == "__main__":
    main()

We create the folders needed for CTC segmentation and download NeMo's segmentation script

In [None]:
! mkdir $DATA_DIR/facts_data
! mkdir $DATA_DIR/text
! mkdir $DATA_DIR/audio

In [None]:
if 'google.colab' in str(get_ipython()) and not os.path.exists('run_sample.sh'):
    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/tools/ctc_segmentation/run_sample.sh', '.')

This script segments all case audio based on the transcript and creates the manfiests used for training and testing. One year of opinions take ~3 hours to run. 

In [None]:
MODEL = 'QuartzNet15x5Base-En'
OFFSET = 0
THRESHOLD = -5
if 'google.colab' in str(get_ipython()):
    OUTPUT_DIR_2 = f'/content/{WORK_DIR}/output_multiple_files'
else:
    OUTPUT_DIR_2 = os.path.join(WORK_DIR, 'output_multiple_files')

! bash $TOOLS_DIR/../run_sample.sh \
--MODEL_NAME_OR_PATH=$MODEL \
--DATA_DIR=$DATA_DIR \
--OUTPUT_DIR=$OUTPUT_DIR_2 \
--SCRIPTS_DIR=$TOOLS_DIR \
--CUT_PREFIX=0 \
--MIN_SCORE=$THRESHOLD  \
--USE_NEMO_NORMALIZATION=False

Once we have run the above for each year in 2018 back to 2000, we can create a single manifest for or dev set (2015 and 2016) or our val set (2017, 2018). We found it easiest to manually move files over to Drive.

Note that we exclude examples longer than 16 seconds, per the QuartzNet paper, to avoid CUDA memory errors

In [None]:
test_manifest_full_final = '/content/drive/MyDrive/Colab Notebooks/paired_final/test_manifest_full_final.json'
#create a new manifest with ONLY files shorter than 16 seconds from the test manifest
with open(test_manifest_full_final, 'w') as fout:
  for year in range(2015,2016):
    with open('/content/drive/MyDrive/Colab Notebooks/WORK_DIR_'+str(year)+'/output_multiple_files/all_manifest.json', 'r') as fin:
      for line in fin:
        # fix filepath
        line = line.replace("content/","content/drive/MyDrive/Colab Notebooks/")
        json_line = json.loads(line)
        file_to_check = json_line['audio_filepath']
        # make sure the filepath exists
        if path.exists(file_to_check) == True:
          duration = float(line[line.find("duration")+10:line.find(',',line.find('duration'))])
          if duration <=16:
            fout.write(line)

Similarly, we can create a training set with all of our real paired data by executing the below.

Please note later models were trained with just one year of paired data, so we would run the below with just in [2014] or in [2013,2014] to make those

In [None]:
train_manifest_full_final = '/content/drive/MyDrive/Colab Notebooks/paired_final/train_manifest_full_final.json'
#create a new manifest with ONLY files shorter than 16 seconds from the test manifest
with open(train_manifest_full_final, 'w') as fout:
  for year in [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014]:
    with open('/content/drive/MyDrive/Colab Notebooks/WORK_DIR_'+str(year)+'/output_multiple_files/all_manifest.json', 'r') as fin:
      for line in fin:
        # fix filepath
        line = line.replace("content/","content/drive/MyDrive/Colab Notebooks/")
        json_line = json.loads(line)
        # make sure everything exists
        file_to_check = json_line['audio_filepath']
        if path.exists(file_to_check) == True:
          duration = float(line[line.find("duration")+10:line.find(',',line.find('duration'))])
          if duration <=16:
            fout.write(line)

### 2. Scrape and format Oyez Hot Text


Now that we have our real audio, we need to download Oyez's accompanying "facts of the case" which will be our "hot text". The below will grab all HT for 2017 and 2018. Note for our dev set models, we would train on 2015,2016 hot text and for our val set models, we would grab 2017, 2018 hot text so that the text is highly overlapping with the real test set vocabulary

In [None]:
## create data directory and download an audio file
WORK_DIR = 'WORK_DIR_2018_HT_nemo'
DATA_DIR = WORK_DIR + '/DATA'
os.makedirs(DATA_DIR, exist_ok=True)

We modify the same functions above, adapted from the Speaker Diarization paper. RegEx stackoverflow cites:
* https://stackoverflow.com/questions/3398852/using-python-remove-html-tags-formatting-from-a-string
* https://stackoverflow.com/questions/40196941/regex-to-remove-periods-in-acronyms/40197005
* https://stackoverflow.com/questions/47625950/split-string-by-spaces-into-substrings-with-max-length-in-python/47626150

In [None]:
# FOR TEXT SCRAPING

from datetime import date
import traceback
import os
import json
import re
from urllib.request import urlopen
import textwrap

import requests
#from ratelimit import limits, sleep_and_retry

YEARS_TO_GO_BACK = 2

def get_http_json(url):
    print(f"Getting {url}")
    response = requests.get(url)
    parsed = response.json()
    return parsed

def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

# Now this function is re-written to download the "facts of the case" from Oyez and writes it as a .txt file
# Note that we process the data using RegEx to delete punctuation, HTML, and periods after common abbreviations
def get_case(term, docket):
    """Get the info of the case and fetch all
    transcripts that the info links to"""
    url = f"https://api.oyez.org/cases/{term}/{docket}"
    docket_data = get_http_json(url)
    print("here")
    print("there")
    facts = docket_data["facts_of_the_case"]
    facts = re.sub(r'(?<!\w)([A-Z])\.', r'\1', facts)
    facts = striphtml(facts)
    facts = re.sub('[^A-Za-z0-9- ?.]+', '', facts)
    facts = facts.replace('v.','v')
    facts = facts.replace('Mr.','Mr')
    facts = facts.replace('Mrs.','Mrs')
    facts = facts.replace('Ms.','Ms')
    facts = facts.replace("?",".")
    question = docket_data["question"]
    question = re.sub(r'(?<!\w)([A-Z])\.', r'\1', question)
    question = striphtml(question)
    question = re.sub('[^A-Za-z0-9- ?.]+', '', question)
    question = question.replace('v.','v')
    question = question.replace('Mr.','Mr')
    question = question.replace('Mrs.','Mrs')
    question = question.replace('Ms.','Ms')
    question = question.replace("?",".")
    conclusion = docket_data["conclusion"]
    conclusion = re.sub(r'(?<!\w)([A-Z])\.', r'\1', conclusion)
    conclusion = striphtml(conclusion)
    conclusion = re.sub('[^A-Za-z0-9- ?.]+', '', conclusion)
    conclusion = conclusion.replace('v.','v')
    conclusion = conclusion.replace('v.','v')
    conclusion = conclusion.replace('Mr.','Mr')
    conclusion = conclusion.replace('Mrs.','Mrs')
    conclusion = conclusion.replace('Ms.','Ms')
    conclusion = conclusion.replace("?",".")
    facts = facts.split(".")
    question = question.split(".")
    conclusion = conclusion.split(".")
    print(facts)
    docket_number = docket_data["docket_number"]
    # We save all the text to a text file for this case. We also wrap text at 100 characters
    # To avoid errors with audio that gets cutoff
    with open(os.path.join(DATA_DIR+'/facts_data/', docket_number + '.txt'), 'w') as f:
      for fact in facts:
        fact = fact.lstrip()
        fact_list = textwrap.wrap(fact, 100, break_long_words=False)
        for single_line in fact_list:
          f.write(single_line+'\n')
      for q in question:
        q = q.lstrip()
        q_list = textwrap.wrap(q, 100, break_long_words=False)
        for single_line in q_list:
          f.write(single_line+'\n')
      for conc in conclusion:
        conc = conc.lstrip()
        conc_list = textwrap.wrap(conc, 100, break_long_words=False)
        for single_line in conc_list:
          f.write(single_line+'\n')


def write_case(term, docket, docket_data, transcripts):
    """
    Writes term-docket.json file with docket_data
    For each transcript, writes the term-docket-t##.json file
    """
    with open(f"oyez/cases/{term}.{docket}.json", "w") as docket_file:
        json.dump(docket_data, docket_file, indent=2)

    count = 0
    for t in transcripts:
        count += 1
        t_filename = "oyez/cases/{}.{}-t{:0>2d}.json".format(term, docket, count)
        with open(t_filename, "w") as t_file:
            json.dump(t, t_file, indent=2)

# this grabs the cases for the year specified from the dict
def fetch_missing(cases):
    """
    cases is a map of tuples to Summary (term, docket) : {SUMMARY}
    For each case, fetch the docket and transcript data and write to a file
    
    return set of cases that this was succesful for
    """
    
    
    count = 0
    total = len(cases)
    succesful = set()
    for term, docket in cases.keys():
        ## pull the file
        count += 1
        print(term,docket)
        print(f"Trying: {term}/{docket}\t\t{count}/{total}")
        try:
            docket_data, transcripts = get_case(term, docket)
            if not transcripts:
                # No transcripts for this case yet
                continue
        except Exception as exc:
            traceback.print_exc()
            print(f"Failed for {term}/{docket}, continuing anyways")
    return succesful

# this builds a dict of all cases in the year specified
def find_missing(years):
    """
    Fetch all summaries for given years and find any that are
    missing in the local "known_map"
    """
    to_fetch = {}
    for year in years:
        summary_url = f"https://api.oyez.org/cases?per_page=0&filter=term:{year}"
        summaries = get_http_json(summary_url)
        for summary in summaries:
          to_fetch[(summary["term"], summary["docket_number"])] = summary

    return to_fetch


def years_to_recheck():
    """
    Makes a list of years going back to
    YEARS_TO_GO_BACK
    e.g. [2018, 2019]
    """
    cur_year = 2018
    return list(range(cur_year - YEARS_TO_GO_BACK + 1, cur_year + 1))

# runs the above functions to download all "Facts of the case" and write as .txt files
def main():
    """
    Find any cases that the server is updated with but we don't have locally
    and fetch the case info and transcripts for them.
    For all cases this is succesful for, also update case_summaries
    """
    missing_summaries = find_missing(years_to_recheck())

    print(f"Missing {len(missing_summaries)} cases")
    print(missing_summaries.keys())

    fetch_missing(missing_summaries)

if __name__ == "__main__":
    main()

### 3. Synthesize TTS from hot text

Now we load NVIDIA's tts_tacotron2 model as our spectogram generator, and NVIDIA's tts_waveglow vocoder model which converts spectrograms to audio

In [None]:
import soundfile as sf
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

# Download and load the pretrained tacotron2 model
spec_gen = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")
# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")



For every line in our text data, we synthesize speech. Note that the tacotron model first generates a spectrogram from the text, which the vocoder turns into audio.

Note we also uncovered a bug in soundfile, which we submitted on GitHub: https://github.com/bastibe/python-soundfile/issues/203

The below is adapted from: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tts/1_TTS_inference.ipynb#scrollTo=-BB2-KokaP08

Stackoverflow cites:
* https://stackoverflow.com/questions/7099290/how-to-ignore-hidden-files-using-os-listdir

In [None]:
from IPython.display import Audio #Import Audio method from IPython's Display Class
file_list = [f for f in os.listdir('/content/WORK_DIR_2014_HT_nemo/DATA/facts_data') if not f.startswith('.')] 
print(file_list)
for file in file_list:
  counter=1
  with open(os.path.join(DATA_DIR+'/facts_data/'+file), 'r') as f:
    save_name = file[0:-4]
    for line in f:
      if len(line)>3:
        with open(os.path.join(DATA_DIR+'/text/'+save_name+'-'+str(counter)+'.txt'), 'w') as fout:
          fout.write(line)
        # the sprectrogram generator generates a spectrogram from the line of text
        parsed = spec_gen.parse(line)
        spectrogram = spec_gen.generate_spectrogram(tokens=parsed)
        # the vocoder conters this to audio
        audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
        audio = audio.to('cpu').numpy()
        audio2 = audio.transpose()
        sf.write('/content/WORK_DIR_2014_HT_nemo/DATA/audio/'+save_name+'-'+str(counter)+'.wav', audio2, 22050)
        counter=counter+1

We move all our text and audio files to Google Drive for safekeeping

In [None]:
!mv '/content/WORK_DIR_2014_HT_nemo' '/content/drive/MyDrive/Colab Notebooks'

Adapted from: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb#scrollTo=7mP4r1Gx_Ilt

We now build a manifest with the "hot text" to match the format that NeMo needs.

As written, the below code creates the "hot text" training manifest for our val model (2017 and 2018 data).

To build the manifest for training against our dev set, we simply change the years to 2016 (that folder contains the hot text for 2015 and 2016)


In [None]:
# --- Building Manifest Files --- #
import librosa
import json
import os

# Function to build a manifest
def build_manifest(transcripts_path, manifest_path):
    with open(manifest_path, 'w') as fout:
      for year in [2018]:
        # we go through every year in the HT folders and grab all the transcripts
        transcripts_path = transcripts_path.replace("2018",str(year))
        file_list = [f for f in os.listdir(transcripts_path) if not f.startswith('.')] 
        for t_file in file_list: 
        # we read in each line from the transcript, make it lower case
        # we grab the audiofile path
        # we grab the duration
        # put all those together into one line in the manifest
        # then we go to the next transcript
         with open(transcripts_path+t_file, 'r') as fin:
            for line in fin:
                print(line)
                transcript = line.lower()
                audio_file = t_file.replace(".txt",".wav")
                audio_path = transcripts_path.replace("text","audio")+audio_file
                duration = librosa.core.get_duration(filename=audio_path)

                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript
                }
                json.dump(metadata, fout)
                fout.write('\n')
                
# Building Manifests
print("******")
path_transcripts = '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2018_HT_nemo/DATA/text/'
path_manifest = '/content/drive/MyDrive/Colab Notebooks/TTS_paths/train_final.json'
build_manifest(path_transcripts, path_manifest)
print("Training manifest created.")


Finally, we process the manifest to make sure it doesn't have extraneous text in the text line, and this creates a nice backup copy just in case something happens!



In [None]:
import json
with open('/content/drive/MyDrive/Colab Notebooks/TTS_manifests/train_for_val.json','r') as fin:
  with open('/content/drive/MyDrive/Colab Notebooks/TTS_manifests/train_for_val_corrected.json','w') as fout:
    for line in fin:
      json_line = json.loads(line)
      # we eliminate any extraneous end-of-line characters from our transcripts
      json_line["text"] = json_line["text"].replace("\n","")
      json.dump(json_line,fout)
      fout.write("\n")

Finally, please note to create the mixed training manifests, e.g., 2 years of TTS and 1 year of paired data, we would manually copy and paste two manifests together using the above code to get the years right

**Legacy code:**  we experimented with Google's TTS model as well, but found it had a slight accent and did not perform well as training data. Code is left below for completeness, but no Google TTS data appears in our final report

In [None]:
import time
from gtts import gTTS #Import Google Text to Speech
from IPython.display import Audio #Import Audio method from IPython's Display Class
file_list = [f for f in os.listdir('/content/WORK_DIR_2014_HT_nemo/DATA/facts_data') if not f.startswith('.')] 
print(file_list)
for file in file_list:
  counter=1
  with open(os.path.join(DATA_DIR+'/facts_data/'+file), 'r') as f:
    save_name = file[0:-4]
    for line in f:
      print(save_name)
      print(f)
      print(len(line))
      if len(line)>3:
        with open(os.path.join(DATA_DIR+'/text/'+save_name+str(counter)+'.txt'), 'w') as fout:
          fout.write(line)
        time.sleep(1.5)
        tts = gTTS(line) #Provide the string to convert to speech
        tts.save('/content/WORK_DIR_2014_HT_nemo/DATA/audio/'+save_name+str(counter)+'.wav') #save the string converted to speech as a .wav file
        counter=counter+1