## Packages

In [1]:
import os.path as osp
from pathlib import Path
from time import time

import numpy as np
import pandas as pd
import librosa

from IPython.display import Audio

## Arguments & User Defined Functions

In [18]:
min_words = 40
wavs_dir = "../wavs2/"
target_sr = 16000
transcripts_path = "../outputs/all_transcripts_v2.csv"
transcripts = pd.read_csv(transcripts_path)
print(transcripts.shape)

(111714, 9)


In [11]:
def play_audio(signal, rate):
    return Audio(data=signal, rate=rate)

## Collect All Transcripts

In [12]:
wavs = [path for path in Path(wavs_dir).rglob("*.wav")]
print("WAV Files:", len(wavs))

WAV Files: 488


In [19]:
data = (
    transcripts.loc[
        (transcripts["word_count"] >= min_words)
        & (transcripts["speaker_role"] == "scotus_justice")
    ]
    .copy()
    .reset_index(drop=True)
)

print(data.shape)
data["start_idx"] = np.floor(data["start"] * target_sr).astype(int)
data["end_idx"] = np.ceil(data["end"] * target_sr).astype(int)

data.to_csv("../outputs/data_transcripts_v2.csv", index=False)

(17138, 9)


In [23]:
# st = time()
# all_data = []
# for i, w in enumerate(wavs):
#     wav_file, wav_sr = librosa.load(w, sr=librosa.core.get_samplerate(w))
#     df = pd.read_json(osp.join(w.parent, w.name.replace(".wav", ".json")))
#     df["file"] = w.name
#     df["line"] = df.index

#     # TODO: Add Sample Rate, Channels
#     df["sample_rate"] = wav_sr
#     df["channels"] = len(wav_file.shape)
#     df["duration"] = df["end"] - df["start"]
#     df["start_idx"] = np.floor(df["start"] * wav_sr).astype(int)
#     df["end_idx"] = np.ceil(df["end"] * wav_sr).astype(int)
#     df["word_count"] = df["text"].apply(lambda x: len(x.split(" ")))

#     data = (
#         df.loc[(df["word_count"] >= 40) & (df["speaker_role"] == "scotus_justice")]
#         .copy()
#         .reset_index(drop=True)
#     )
#     data = data[
#         [
#             "file",
#             "line",
#             "speaker",
#             "start",
#             "end",
#             "duration",
#             "sample_rate",
#             "channels",
#             "start_idx",
#             "end_idx",
#             "word_count",
#             "text",
#         ]
#     ]

#     all_data.append(data)

# all_transcripts = pd.concat(all_data)
# print("\n Record Info:")
# print(all_transcripts.shape)

# print(f"{round(time() - st, 2)}s")


 Record Info:
(2545, 12)
18.79s


In [21]:
a_sample = data.sample(n=5)
a_sample

Unnamed: 0,file,line,start,end,speaker,speaker_role,word_count,duration,text,start_idx,end_idx
1163,12-574,30,671.684,721.033,Anthony_M_Kennedy,scotus_justice,146,49.349,"But in -- in this case, it was known or should...",10746944,11536528
14617,19-1414,75,1799.385,1848.125,Amy_Coney_Barrett,scotus_justice,136,48.74,"Mr. Feigin, I'd like to go back to your interc...",28790160,29570000
13496,18-556,131,1618.28,1630.72,Sonia_Sotomayor,scotus_justice,41,12.44,"Well, if you drive by. Plenty of police office...",25892480,26091520
17000,20-543,23,566.86,587.83,Stephen_G_Breyer,scotus_justice,73,20.97,"How do you do that? I mean, that's -- that's w...",9069760,9405280
13579,18-587,335,4106.84,4131.44,John_G_Roberts_Jr,scotus_justice,83,24.6,"-- what if it were less, as you view, in categ...",65709440,66103040


## Listen to Audio

In [22]:
sample_1 = dict(a_sample.iloc[0])
wav_file, wav_sr = librosa.load(
    path=osp.join(wavs_dir, f"{sample_1['file']}.wav"),
    sr=librosa.core.get_samplerate(osp.join(wavs_dir, f"{sample_1['file']}.wav")),
)

print("Speaker:", sample_1["speaker"])
print("File - Line", sample_1["file"], "-", sample_1["line"])
print("Duration:", sample_1["duration"])
print("Text:", sample_1["text"])
play_audio(wav_file[sample_1["start_idx"] : sample_1["end_idx"]], wav_sr)

Speaker: Anthony_M_Kennedy
File - Line 12-574 - 30
Duration: 49.349
Text: But in -- in this case, it was known or should have been known that these were gamblers, they were in Nevada. That's where a lot of -- that's where their gambling takes place. They were residents of Nevada. So in that sense, they were like the plaintiff in -- in Calder. The injury was there and the defendant arguably knew or should have known that that's where its major impact would be. I recognize your point that when you take money away, then you're inconvenienced in any State where you happen to be. But there was an argument here -- it seems to me there is an argument here that this was gambling and these people were from Nevada and so you've -- this -- this curtails their right or -- or their option to conduct -- to conduct their activities in -- in Nevada.


In [24]:
sample_1 = dict(a_sample.iloc[1])
wav_file, wav_sr = librosa.load(
    path=osp.join(wavs_dir, f"{sample_1['file']}.wav"),
    sr=librosa.core.get_samplerate(osp.join(wavs_dir, f"{sample_1['file']}.wav")),
)

print("Speaker:", sample_1["speaker"])
print("File - Line", sample_1["file"], "-", sample_1["line"])
print("Duration:", sample_1["duration"])
print("Text:", sample_1["text"])
play_audio(wav_file[sample_1["start_idx"] : sample_1["end_idx"]], wav_sr)

Speaker: Amy_Coney_Barrett
File - Line 19-1414 - 75
Duration: 48.74
Text: Mr. Feigin, I'd like to go back to your interchange with Justice Gorsuch. You said that the authority -- the investigative authority doesn't extend past Terry stops into arrests because arrests mark the beginning of the adjudicatory process. We -- I -- I didn't quite follow whether you were saying to Justice Gorsuch that the reason why tribes lack authority to arrest is because they are implicitly divested of that authority under the Constitution, so even under the Colville rationale or whether it's the cross-deputization statutes or whether it's our prior cases making clear that tribes lack the authority to finally adjudicate the rights, criminally or civilly, of non-members. So could you just explain to me what it is that takes away that authority, or is it that they never possessed it in the first place?


In [25]:
sample_1 = dict(a_sample.iloc[2])
wav_file, wav_sr = librosa.load(
    path=osp.join(wavs_dir, f"{sample_1['file']}.wav"),
    sr=librosa.core.get_samplerate(osp.join(wavs_dir, f"{sample_1['file']}.wav")),
)

print("Speaker:", sample_1["speaker"])
print("File - Line", sample_1["file"], "-", sample_1["line"])
print("Duration:", sample_1["duration"])
print("Text:", sample_1["text"])
play_audio(wav_file[sample_1["start_idx"] : sample_1["end_idx"]], wav_sr)

Speaker: Sonia_Sotomayor
File - Line 18-556 - 131
Duration: 12.44
Text: Well, if you drive by. Plenty of police officers that let someone they want to stop move forward from where they are and then pull in behind them. There's a whole lot of things that could be done to do that.


In [26]:
sample_1 = dict(a_sample.iloc[3])
wav_file, wav_sr = librosa.load(
    path=osp.join(wavs_dir, f"{sample_1['file']}.wav"),
    sr=librosa.core.get_samplerate(osp.join(wavs_dir, f"{sample_1['file']}.wav")),
)

print("Speaker:", sample_1["speaker"])
print("File - Line", sample_1["file"], "-", sample_1["line"])
print("Duration:", sample_1["duration"])
print("Text:", sample_1["text"])
play_audio(wav_file[sample_1["start_idx"] : sample_1["end_idx"]], wav_sr)

Speaker: Stephen_G_Breyer
File - Line 20-543 - 23
Duration: 20.97
Text: How do you do that? I mean, that's -- that's what I can't quite figure out, because there's an argument, you know, that even if the ISDA applies, the CARES Act doesn't apply. But I don't see -- once you say the ISDA -- once that definition applies, and it's a statute that really doesn't make sense to put this kind of corporation in it, how do you read them out of it?


In [27]:
sample_1 = dict(a_sample.iloc[4])
wav_file, wav_sr = librosa.load(
    path=osp.join(wavs_dir, f"{sample_1['file']}.wav"),
    sr=librosa.core.get_samplerate(osp.join(wavs_dir, f"{sample_1['file']}.wav")),
)

print("Speaker:", sample_1["speaker"])
print("File - Line", sample_1["file"], "-", sample_1["line"])
print("Duration:", sample_1["duration"])
print("Text:", sample_1["text"])
play_audio(wav_file[sample_1["start_idx"] : sample_1["end_idx"]], wav_sr)

Speaker: John_G_Roberts_Jr
File - Line 18-587 - 335
Duration: 24.6
Text: -- what if it were less, as you view, in categorical terms? What if the Attorney General said, I've looked at this, it's -- it's -- it's a close case, but, on balance, I don't think we have the authority? Or if he said, I'm pretty sure we don't have the authority, but a court might come out differently? Does your analysis change, or is it only when he says this is -- as far as I'm concerned, this is definite; it's illegal?
