<a href="https://colab.research.google.com/github/ved-phadke/math-m148-final-project/blob/main/transcript_dataset_construction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transcripts

Aim: output a dataset of all sources with the transcripts in one column for easy text processing.

## Imports


In [None]:
!pip install youtube_transcript_api
! pip install youtube_search
!pip install lyricsgenius


Collecting youtube_search
  Downloading youtube_search-2.1.2-py3-none-any.whl.metadata (1.2 kB)
Downloading youtube_search-2.1.2-py3-none-any.whl (3.4 kB)
Installing collected packages: youtube_search
Successfully installed youtube_search-2.1.2


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from youtube_transcript_api import YouTubeTranscriptApi
import re
from youtube_search import YoutubeSearch
import lyricsgenius

## Read in Data

In [None]:
stim_df = pd.read_csv('/content/Stimuli.csv')
stim_df.head()

Unnamed: 0,Stimulus name,Description,URL
0,Agnus Dei (Audio),The Flemish Radio Choir performs Samuel Barber...,https://youtu.be/bFnbGevBnvY
1,"Misere Mei, Deus (Audio)",Tenebrae Choir performs Gregorio Allegri’s Mis...,https://youtu.be/3nakMFiPB0w
2,3rd Grade Dropout Speech (Audio),"Rick Rigsby is an ordained minister, motivatio...",https://youtu.be/Yu23MU4vsIM
3,Unbroken - Motivation (Audio),"This motivational compilation, from a series b...",https://youtu.be/QRE2CUZxtQY
4,Laughing Heart (Audio),The Laughing Heart is a classic poem by Charle...,https://youtu.be/9COXybhp8p8


## Function Declarations

In [None]:
def extract_video_id(url):
  """
  Extracts the video ID from a YouTube URL.

  Args:
    url: The YouTube URL.

  Returns:
    The video ID, or None if the URL is invalid.
  """
  match = re.search(r"(?:v=|\/)([0-9A-Za-z_-]{11})", url)
  if match:
    return match.group(1)
  return None

# Apply the function to the 'URL' column
stim_df['video_id'] = stim_df['URL'].apply(extract_video_id)


In [None]:
def get_transcript(video_id):
    """
    Fetches the transcript for a given YouTube video ID using the YouTubeTranscriptApi.

    Parameters:
    video_id (str): The unique identifier for the YouTube video.

    Returns:
    list[dict] | None: A list of dictionaries containing the transcript text and timestamps,
                        or None if an error occurs.
    """
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return transcript
    except Exception as e:
        print(f"Error getting transcript for video ID {video_id}: {e}")
        return None

# Apply the function to extract transcripts for all video IDs in the DataFrame
stim_df['transcript'] = stim_df['video_id'].apply(get_transcript)


Error getting transcript for video ID bFnbGevBnvY: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=bFnbGevBnvY! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
Error getting transcript for video ID 3nakMFiPB0w: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=3nakMFiPB0w! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.c

In [None]:
def search_and_get_transcript(stimulus_name):
    """
    Searches for a YouTube video based on a given stimulus name and retrieves its transcript.

    Parameters:
    stimulus_name (str): The name of the stimulus to search for on YouTube.

    Returns:
    str | None: The transcript as a single concatenated string if found, otherwise None.
    """
    try:
        # Search YouTube for the stimulus name, retrieving the top result
        results = YoutubeSearch(stimulus_name, max_results=1).to_dict()

        if results:
            video_id = results[0]['id']
            transcript = YouTubeTranscriptApi.get_transcript(video_id)

            # Join the transcript segments into a single string
            transcript_text = ' '.join([segment['text'] for segment in transcript])
            return transcript_text
        else:
            print(f"No search results found for: {stimulus_name}")
            return None
    except Exception as e:
        print(f"Error searching or getting transcript for {stimulus_name}: {e}")
        return None

# Iterate through the DataFrame and fill missing transcripts
for index, row in stim_df.iterrows():
    if row['transcript'] is None or not row['transcript']:  # Check for empty or None transcripts
        transcript = search_and_get_transcript(row['Stimulus name'])
        if transcript:
            stim_df.loc[index, 'transcript'] = transcript  # Store transcript as a single string


Error searching or getting transcript for Agnus Dei (Audio): 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=K6yeaHNXsBs! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
Error searching or getting transcript for Misere Mei, Deus (Audio): 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=rs5bc_P1kKo! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create a

In [None]:
# Initialize the Genius API with the provided API key
genius = lyricsgenius.Genius("HxfZBC6XbrLahNVVwSk8DrImZjg5Quwe3Q-fhxECkH3BCfqrxAYGFGrHQ4oeyFgW")
# Replace with your actual Genius API key

def get_lyrics(song_title, artist_name):
    """
    Fetches song lyrics from the Genius API based on song title and artist name.

    Parameters:
    song_title (str): The title of the song.
    artist_name (str): The name of the artist.

    Returns:
    str | None: The lyrics of the song as a single string if found, otherwise None.
    """
    try:
        song = genius.search_song(song_title, artist_name)
        if song:
            return song.lyrics
        else:
            print(f"Lyrics not found for {song_title} by {artist_name}")
            return None
    except Exception as e:
        print(f"Error getting lyrics for {song_title} by {artist_name}: {e}")
        return None

# Iterate through the DataFrame and fill missing transcripts with lyrics
for index, row in stim_df.iterrows():
    if row['transcript'] is None or not row['transcript']:  # Check for empty or missing transcripts
        parts = row['Stimulus name'].split(" - ")  # Assume format: "Artist - Song Title"

        if len(parts) == 2:
            song_title = parts[1].strip()
            artist_name = parts[0].strip()
            lyrics = get_lyrics(song_title, artist_name)
            if lyrics:
                stim_df.loc[index, 'transcript'] = lyrics  # Store lyrics as transcript
        else:
            print(f"Could not parse stimulus name: {row['Stimulus name']}")
stim_df.at[37, 'transcript'] = rocky

[31mERROR: Operation cancelled by user[0m[31m
[0mCould not parse stimulus name: Agnus Dei (Audio)
Could not parse stimulus name: Misere Mei, Deus (Audio)
Could not parse stimulus name: Clair de Lune (Audio)
Could not parse stimulus name: Aramaic Choir (Audio)
Could not parse stimulus name: Air France 
Searching for "Time" by Hans Zimmer...
Error getting lyrics for Time by Hans Zimmer: [Errno 403] 403 Client Error: Forbidden for url: https://genius.com/api/search/multi?q=Time+Hans+Zimmer
Could not parse stimulus name: Aramaic Choir


Unnamed: 0,Stimulus name,Description,URL,video_id,transcript
0,Agnus Dei (Audio),The Flemish Radio Choir performs Samuel Barber...,https://youtu.be/bFnbGevBnvY,bFnbGevBnvY,
1,"Misere Mei, Deus (Audio)",Tenebrae Choir performs Gregorio Allegri’s Mis...,https://youtu.be/3nakMFiPB0w,3nakMFiPB0w,
2,3rd Grade Dropout Speech (Audio),"Rick Rigsby is an ordained minister, motivatio...",https://youtu.be/Yu23MU4vsIM,Yu23MU4vsIM,[{'text': 'the wisest person I ever met in my ...
3,Unbroken - Motivation (Audio),"This motivational compilation, from a series b...",https://youtu.be/QRE2CUZxtQY,QRE2CUZxtQY,[{'text': 'you can't connect the dots looking'...
4,Laughing Heart (Audio),The Laughing Heart is a classic poem by Charle...,https://youtu.be/9COXybhp8p8,9COXybhp8p8,"[{'text': 'thank you', 'start': 1.68, 'duratio..."
5,Hallelujah Choir (Audio),Choir! Choir! Choir! began as a weekly drop-in...,https://youtu.be/gCrUi_tRN8g,gCrUi_tRN8g,\n[Verse 1]\nNow I've heard there was a secret...
6,Jason Silva - Existential Bummer (Audio),Storyteller Jason Silva considers the imperman...,https://youtu.be/Lz-P3WdIHvw,Lz-P3WdIHvw,"[{'text': 'foreign', 'start': 0.06, 'duration'..."
7,Clair de Lune (Audio),"Claude Debussy’s “Clair de Lune,” (Suite berga...",https://youtu.be/JRinyHJ_9-E,JRinyHJ_9-E,
8,Carl Sagan Pale Blue Dot (Audio),"On Feb. 14, 1990, astronomer Carl Sagan gave a...",https://youtu.be/T2Qv_Vms-Yw,T2Qv_Vms-Yw,"\nFrom this distant vantage point, the Earth m..."
9,Motorcycle Diaries (Audio),De Ushuaia a La Quiaca (From Ushuaia to La Qui...,https://youtu.be/D95hQkiRNrQ,D95hQkiRNrQ,the Motorcycle Diaries notes on a Latin Americ...


In [None]:
def process_transcripts(df):
    """
    Processes the 'transcript' column in a DataFrame by converting transcript lists into
    single concatenated text strings.

    Parameters:
    df (pd.DataFrame): DataFrame containing a 'transcript' column.

    Returns:
    pd.DataFrame: The modified DataFrame with transcripts formatted as single strings.
    """
    def join_transcript_texts(transcript_list):
        """
        Joins a list of transcript segments into a single text string.

        Parameters:
        transcript_list (list | str): A list of dictionaries containing transcript text,
                                      or a pre-existing transcript string.

        Returns:
        str: The joined transcript text if the input is a list, otherwise returns the input as is.
        """
        if isinstance(transcript_list, list):
            joined_text = ' '.join([item['text'] for item in transcript_list if isinstance(item, dict) and 'text' in item])
            return joined_text
        else:
            return transcript_list  # Return as is if not a list

    # Apply transformation to the 'transcript' column
    df['transcript'] = df['transcript'].apply(join_transcript_texts)

    return df

# Process transcripts in the DataFrame
stim_df = process_transcripts(stim_df)

# Create a new column indicating whether a transcript is present (True) or missing (False)
stim_df['only_music'] = stim_df['transcript'].astype(bool)

# Save the processed DataFrame to a CSV file
stim_df.to_csv('stim_df.csv', index=False)
