# Data Gathering Code
This notebook is what was used to gather the transcripts using the Youtube API.
It has been cleaned up since it's original run  to obtain the transcripts on 12/4/2023. The original messier version of that is archived.  Might add proper code comments later.

In [None]:
# Imports
import requests
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound
import os

In [6]:
youtube_api_key = open('./../../API_keys/youtube.txt').read()

In [9]:
def get_video_ids(api_key, channel_id):
    url = "https://www.googleapis.com/youtube/v3/search"
    video_ids = []
    next_page_token = None

    while True:
        params = {
            'part': 'id',
            'channelId': channel_id,
            'maxResults': 50,
            'pageToken': next_page_token,
            'type': 'video',
            'key': api_key
        }

        response = requests.get(url, params=params).json()

        video_ids += [item['id']['videoId'] for item in response.get('items', [])]

        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break

    return video_ids


In [25]:
def get_video_title(video_id, api_key):
    url = f"https://www.googleapis.com/youtube/v3/videos?id={video_id}&key={api_key}&part=snippet"
    response = requests.get(url).json()
    title = response['items'][0]['snippet']['title']
    return title

In [20]:
def get_transcript(video_id):
    try:
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
        return " ".join([item['text'] for item in transcript_list])
    except NoTranscriptFound:
        return "No transcript found"

# Example Usage
# video_ids = get_video_ids(your_api_key, channel_id)
# for video_id in video_ids:
#     print(get_transcript(video_id))


In [15]:
list_video_IDs = get_video_ids(youtube_api_key, "UCv4VkfbX8YfqodF-4coEEfQ")
print(len(list_video_IDs))
list_video_IDs[0:5]

82


['W84ws9AazSc', 'VNvH3a6Aenw', '4wi49P-Qjcc', 'XoUR_PQIdRg', '0Vijus_c-aY']

In [23]:
dict_transcripts_two = {}

for str_video_id in list_video_IDs:
    try:
        dict_transcripts_two[str_video_id] = get_transcript(str_video_id)
    except:
        print("transcript for video: '" + str_video_id + "' was unavailable")

# video_transcripts now contains a dictionary where keys are video IDs and values are transcripts


transcript for video: 'H65WG2s4pzY' was unavailable


When we search up [the video that did not have a transcript](https://www.youtube.com/watch?v=H65WG2s4pzY)
we see that it's just a 2:46 review of Days of Future Past who's subs are unavailable for whatever reason.

In [34]:
dict_transcripts_two = {}

for str_video_id in list_video_IDs:
    try:
        dict_transcripts_two[str_video_id] = {'title': get_video_title(str_video_id, youtube_api_key), 
                                          'transcript': get_transcript(str_video_id)}
    except:
        print("Info for video: '" + str_video_id + "' was unavailable")

# video_transcripts now contains a dictionary where keys are video IDs and values are transcripts


Info for video: 'H65WG2s4pzY' was unavailable


In [35]:
print(len(dict_transcripts_two))
#dict_transcripts_two

81


In [36]:
print(len(dict_transcripts))
#dict_transcripts

71


In [42]:
for video_id, details in dict_transcripts_two.items():
    title = details['title']
    transcript = details['transcript']
    
    # Replace characters not allowed in file names
    filename = "".join([c for c in title if c.isalpha() or c.isdigit() or c==' ']).rstrip()
    
    # Limiting filename length to avoid errors on some file systems
    filename = filename[:100] if len(filename) > 100 else filename

    # Writing to file
    with open(f"./data/{filename}.txt", "w", encoding="utf-8") as file:
        file.write(transcript)


In [55]:
int_words_total = 0
for str_key_id in dict_transcripts_two.keys():
    #dict_transcripts_two[str_key_id]["word_count"] = len(dict_transcripts_two[str_key_id]["transcript"].split())
    #print(dict_transcripts_two[str_key_id]["word_count"])
    count = dict_transcripts_two[str_key_id]["word_count"]
    if count < 5000:
        print(count, dict_transcripts_two[str_key_id]["title"])
    int_words_total += dict_transcripts_two[str_key_id]["word_count"]

79 Neil Gaiman: A straight author with amazing queer characters?
136 The Real Burden of Being Rich
4414 The Troubling Thirst for Jeffrey Dahmer
3679 The Traumatic Camp of "Mommie Dearest"
3462 The Secret Crimes of a Dying Franchise
3956 The Gay Horror Manga You Should Be Reading
4639 When Hollywood Came Out of the Closet
4615 America v. Homosexuality
4451 Where The "Bury Your Gays" Trope Came From
4294 How a Gay Show Changed TV... and Was Forgotten
4683 Hollywood's Golden Age (of Queer Coding)
4937 How Hollywood was Born Gay
10 Coming This Fall
35 Fistory!
89 The Magic Realism of Revolutionary Girl Utena
185 Religion and Anime!
127 The Gay Horror Manga You should Be Reading - The Summer Hikaru Died #horrorstories #manga
3421 Heartstopper and Queer Optimism
2510 Harry Potter and The Closet Under The Stairs - Queer themes in Harry Potter (Video essay)
4445 The Queer Joy of Everything Everywhere All At Once
889 Geek Movie Review! Captain America: The Winter Solider
150 The Barbie to Evang

In [60]:
int_sentence_total = 0
for str_key_id in dict_transcripts_two.keys():
    dict_transcripts_two[str_key_id]["sentence_count"] = len(dict_transcripts_two[str_key_id]["transcript"].split("."))
    #print(dict_transcripts_two[str_key_id]["word_count"])
    count = dict_transcripts_two[str_key_id]["sentence_count"]
    if count > 10:
        print(count, dict_transcripts_two[str_key_id]["title"])
    int_sentence_total += dict_transcripts_two[str_key_id]["sentence_count"]

405 The Brilliance of Our Flag Means Death
201 The Secret Crimes of a Dying Franchise
221 The Gay Horror Manga You Should Be Reading
531 The Tragedy of Being Rich | James Somerton
327 The Dangers of Blissful Ignorance
345 The Real Hogwarts Legacy
231 How Hollywood was Born Gay
497 The Sadism of Class
382 For The Love of Gay Nuance
290 Disney's War Against Gay kids | James Somerton
590 SHIPPING - The Good, The Bad, and the Thirsty
491 How Wanda Became An Accidental Gay Icon
500 The Gay Appeal of Toxic Love
390 Hollywood's (Gay) China Problem | James Somerton
318 The Queer Dystopia of the LGB Movement
551 An Over-Emotional Look at Why JK Rowling is Bad
330 Disney's Gay Cultural Appropriation | James Somerton
12 Why Bad Gays Are Good
480 The Necessity of Gay Crime | James Somerton
183 Heartstopper and Queer Optimism
371 The Diversity of "The Rings of Power"
233 Disney's Silence on Gay Youth
27 Making It Big: The History of Gay Adult Film (Documentary)


In [60]:
int_sentence_total = 0
for str_key_id in dict_transcripts_two.keys():
    dict_transcripts_two[str_key_id]["sentence_count"] = len(dict_transcripts_two[str_key_id]["transcript"].split("."))
    #print(dict_transcripts_two[str_key_id]["word_count"])
    count = dict_transcripts_two[str_key_id]["sentence_count"]
    if count > 10:
        print(count, dict_transcripts_two[str_key_id]["title"])
    int_sentence_total += dict_transcripts_two[str_key_id]["sentence_count"]

405 The Brilliance of Our Flag Means Death
201 The Secret Crimes of a Dying Franchise
221 The Gay Horror Manga You Should Be Reading
531 The Tragedy of Being Rich | James Somerton
327 The Dangers of Blissful Ignorance
345 The Real Hogwarts Legacy
231 How Hollywood was Born Gay
497 The Sadism of Class
382 For The Love of Gay Nuance
290 Disney's War Against Gay kids | James Somerton
590 SHIPPING - The Good, The Bad, and the Thirsty
491 How Wanda Became An Accidental Gay Icon
500 The Gay Appeal of Toxic Love
390 Hollywood's (Gay) China Problem | James Somerton
318 The Queer Dystopia of the LGB Movement
551 An Over-Emotional Look at Why JK Rowling is Bad
330 Disney's Gay Cultural Appropriation | James Somerton
12 Why Bad Gays Are Good
480 The Necessity of Gay Crime | James Somerton
183 Heartstopper and Queer Optimism
371 The Diversity of "The Rings of Power"
233 Disney's Silence on Gay Youth
27 Making It Big: The History of Gay Adult Film (Documentary)


In [48]:
int_words_total

509802

In [58]:
int_sentence_total

8110