In [1]:
import os
import re

In [2]:
def load_transcripts(data_dir):
    transcripts = []
    for file_name in os.listdir(data_dir):
        if file_name.endswith(".txt"):
            with open(os.path.join(data_dir, file_name), 'r', encoding='utf-8') as f:
                transcripts.append(f.read())
    return transcripts

In [3]:
def parse_transcript(transcript):
    title_match = re.search(r"Title: (.+)", transcript)
    url_match = re.search(r"URL Source: (.+)", transcript)
    content_match = re.search(r"Markdown Content:(.+)", transcript, re.DOTALL)

    return {
        "title": title_match.group(1) if title_match else None,
        "url": url_match.group(1) if url_match else None,
        "content": content_match.group(1).strip() if content_match else None
    }

In [4]:
# test the parse_transcript function

transcript = """
Title: Transcript for Ben Shapiro vs Destiny Debate: Politics, Jan 6, Israel, Ukraine & Wokeism | Lex Fridman Podcast #410 - Lex Fridman

URL Source: https://lexfridman.com/ben-shapiro-destiny-debate-transcript/

Published Time: 2024-01-22T22:18:28+00:00

Markdown Content:
Introduction
------------

Destiny [(00:00:00)](https://youtube.com/watch?v=tYrdMjVXyNg&t=0) Something has to happen with Iran. There has to be some diplomatic bilateral communication there.

Ben Shapiro [(00:00:04)](https://youtube.com/watch?v=tYrdMjVXyNg&t=4) No. What has to happen is the containment of Iran.

Destiny [(00:00:06)](https://youtube.com/watch?v=tYrdMjVXyNg&t=6) History moves in one direction.

Ben Shapiro [(00:00:10)](https://youtube.com/watch?v=tYrdMjVXyNg&t=10) Communism, Nazism, all of that was a regression from what was happening at, for example, the beginning of the 19th century into the 20th century.

Ben Shapiro [(00:00:17)](https://youtube.com/watch?v=tYrdMjVXyNg&t=17) Do you think that today Donald Trump knows that he lost the election?

Destiny [(00:00:22)](https://youtube.com/watch?v=tYrdMjVXyNg&t=22) This is one of the areas where we get into this, I don’t understand if there’s brain-breaking happening or what’s going on. I don’t know what world we can ever live in where we say that Trump is less divisive for the country than Biden.

Ben Shapiro [(00:00:33)](https://youtube.com/watch?v=tYrdMjVXyNg&t=33) Joe Biden literally used the Occupational Safety and Hazard Administration to try to cram down vax mandates on 80 million Americans. That’s insane.

Destiny [(00:00:41)](https://youtube.com/watch?v=tYrdMjVXyNg&t=41) What about supercalifragilisticexpialidocious?

Ben Shapiro [(00:00:43)](https://youtube.com/watch?v=tYrdMjVXyNg&t=43) What about pneumonoultramicroscopicsilicovolcanoconiosis?

Destiny [(00:00:45)](https://youtube.com/watch?v=tYrdMjVXyNg&t=45) Yeah, or the science terms.

Destiny [(00:00:46)](https://youtube.com/watch?v=tYrdMjVXyNg&t=46) Or what about the 7,000 letter thing that’s from part of a biochem.

Lex Fridman [(00:00:49)](https://youtube.com/watch?v=tYrdMjVXyNg&t=49) I got my education in the Soviet Union. So we just did math. We didn’t run any of this.

Ben Shapiro [(00:00:53)](https://youtube.com/watch?v=tYrdMjVXyNg&t=53) That’s why you’re a useful person.

Lex Fridman [(00:00:54)](https://youtube.com/watch?v=tYrdMjVXyNg&t=54) Does body count matter? The following is a debate between Ben Shapiro and Destiny. Each arguably representing the right and left of American politics respectively. They are two of the most influential and skilled political debaters in the world. This debate has been a long time coming for many years. It’s about 2.5 hours and we could have easily gone for many more. And I’m sure we will. It is only round one. This is the Lex Fridman Podcast to support it. Please check out our sponsors in the description. And now, dear friends, here’s Ben Shapiro and Destiny.

Liberalism vs Conservatism
--------------------------

[(00:01:36)](https://youtube.com/watch?v=tYrdMjVXyNg&t=96) Ben, you’re conservative. Destiny, you’re a liberal. Can you each describe what key values underpin your philosophy on politics and maybe life in the context of this left to right political spectrum? You want to go first?

Destiny [(00:01:50)](https://youtube.com/watch?v=tYrdMjVXyNg&t=110) Yeah. So I think that we have a huge country full of a lot of people, a lot of individual talents, capabilities, and I think that the goal of government, broadly speaking, should be to try to ensure that everybody is able to achieve as much as possible. So on a liberal level, that usually means some people might need a little bit of a boost when it comes to things like education. They might need a little bit of a boost when it comes to providing certain necessities like housing or food or clothing. But broadly speaking, I mean, I’m still a liberal, not a communist or a socialist. I don’t believe in the total command economy, total communist takeover of all of the economy, but I think that broadly speaking, the government should kick in and help people when they need it.

Lex Fridman [(00:02:32)](https://youtube.com/watch?v=tYrdMjVXyNg&t=152) And that government can and should be big?

Destiny [(00:02:34)](https://youtube.com/watch?v=tYrdMjVXyNg&t=154) Not necessarily. I noticed that when liberals talk about government, especially taxes, it seems like they talk about it for taxes sake or bigness sake. So people talk about taxes sometimes as like a punishment, like tax the rich. I think taxing the rich is fine insofar as it funds the programs that we want to fund. But Democrats have a really big problem demonizing success or wealth. And I don’t think that’s a bad thing. I don’t think it’s a bad thing to be wealthy, to be a billionaire or whatever, as long as we’re funding what we need to fund.

Lex Fridman [(00:03:03)](https://youtube.com/watch?v=tYrdMjVXyNg&t=183) Ben, what do you think it means to be a conservative? What’s the philosophy that underlies your political view?

Ben Shapiro [(00:03:07)](https://youtube.com/watch?v=tYrdMjVXyNg&t=187) So first of all, I’m glad that Destiny, you’re already coming out as a Republican. That’s exciting. I mean, we hold a lot in common in terms of the basic idea that people ought to have as much opportunity as possible and also insofar as the government should do the minimum amount necessary to interfere in people’s lives in order to pursue certain functions, particularly at the local level.

[(00:03:33)](https://youtube.com/watch?v=tYrdMjVXyNg&t=213) So a lot of governmental discussions on a pragmatic level end up being discussions about where government ought to be involved, but also at what level government ought to be involved. And I have an incredibly subsidiary view of government. I think that local governments, because you have higher levels of homogeneity and consent are capable of doing more things. And as you abstract up the chain, it becomes more and more impractical and more and more divisive to do more things.

[(00:03:59)](https://youtube.com/watch?v=tYrdMjVXyNg&t=239) In my view, government is basically there to preserve certain key liberties. Those key liberties pre-exist the government insofar as they’re more important than what priorities the government has. The job of government is to maintain, for example, national defense, protection of property rights, protection of religious freedom. These are the key focuses of government as generally expressed in the Bill of Rights and the Constitution. And I agree with the general philosophy of the Bill of Rights and the Constitution.

[(00:04:31)](https://youtube.com/watch?v=tYrdMjVXyNg&t=271) Now, that doesn’t mean by the way, that you can’t do more on a governmental level again as you get closer to the ground, which by the way is also embedded in the Constitution. People forget the Constitution was originally applied to the federal government, not to local and state government. But if I were going to define conservatism, it would actually be a little broader than that because I think to understand how people interact with government, you have to go to core values.

[(00:04:50)](https://youtube.com/watch?v=tYrdMjVXyNg&t=290) And so for me, there are a couple of premises. One, human beings have a nature. That nature is neither good nor bad. We have aspects of goodness and we have aspects of badness. Human beings are sinful. We have temptations. What that means is that we have to be careful not to incentivize the bad and that we should incentivize the good. Human beings do have agency and are capable of making decisions in the vast majority of circumstances. And it’s better for society if we act as though they do.

[(00:05:17)](https://youtube.com/watch?v=tYrdMjVXyNg&t=317) Second, the basic idea of human nature. There is an idea in my view that all human beings have equal value before the law. I’m a religious person, so I’d say equal value before God. But I think that’s also sort of a key tenet of Western civilization being non-religious or religious, that every individual has equivalent value in sort of cosmic terms.

[(00:05:36)](https://youtube.com/watch?v=tYrdMjVXyNg&t=336) But that does not necessarily mean that every person is equally equipped to do everything equally well. And so it is not the job of government to rectify every imbalance of life. The quest for cosmic justice, as Thomas Sowell suggests, is something that government is generally incapable of doing, and more often than not, botches and makes things worse. So those are a few key tenets and that tends to materialize in a variety of ways. The easiest way to sum that up would the traditional kind of three legs of the conservative stool, although now obviously there’s a very fragmented conservative movement in the United States would be a socially conservative view in which family is the chief institution of society, like the little platoons of society as Edmund Burke suggested, in which free markets and property rights are extraordinarily valuable and necessary because every individual has the ability to be creative with their property and to freely alienate that property.

[(00:06:34)](https://youtube.com/watch?v=tYrdMjVXyNg&t=394) Finally, I tend toward a hawkish foreign policy that suggests that the world is not filled with wonderful people who all agree with us and think like us. And those people will pursue adversarial interests if we do not protect our own interests.

Destiny [(00:06:46)](https://youtube.com/watch?v=tYrdMjVXyNg&t=406) Can I ask a question on that? I’m so curious.
"""

result = parse_transcript(transcript)
print(result)

{'title': 'Transcript for Ben Shapiro vs Destiny Debate: Politics, Jan 6, Israel, Ukraine & Wokeism | Lex Fridman Podcast #410 - Lex Fridman', 'url': 'https://lexfridman.com/ben-shapiro-destiny-debate-transcript/', 'content': 'Introduction\n------------\n\nDestiny [(00:00:00)](https://youtube.com/watch?v=tYrdMjVXyNg&t=0) Something has to happen with Iran. There has to be some diplomatic bilateral communication there.\n\nBen Shapiro [(00:00:04)](https://youtube.com/watch?v=tYrdMjVXyNg&t=4) No. What has to happen is the containment of Iran.\n\nDestiny [(00:00:06)](https://youtube.com/watch?v=tYrdMjVXyNg&t=6) History moves in one direction.\n\nBen Shapiro [(00:00:10)](https://youtube.com/watch?v=tYrdMjVXyNg&t=10) Communism, Nazism, all of that was a regression from what was happening at, for example, the beginning of the 19th century into the 20th century.\n\nBen Shapiro [(00:00:17)](https://youtube.com/watch?v=tYrdMjVXyNg&t=17) Do you think that today Donald Trump knows that he lost the 

In [5]:
def parse_transcript_by_subtopic(data):
    transcript = data["content"]
    # Regex to find subtopics (e.g., Introduction, Education)
    subtopic_pattern = re.compile(r"^(.*)\n-+\n", re.MULTILINE)
    # Regex to capture speaker dialogue (e.g., Destiny [(00:00:00)]...)
    dialogue_pattern = re.compile(r"(?P<speaker>\w+)\s\[\((?P<timestamp>\d{2}:\d{2}:\d{2})\)\]\((?P<url>https:\/\/youtube\.com\/watch\?v=[^&]+&t=\d+)\)\s(?P<text>.+)")
    
    chunks = []

    subtopics = subtopic_pattern.split(transcript)

    for i in range(1, len(subtopics), 2):
        subtopic = subtopics[i].strip()
        # print(subtopic)

        content_block = subtopics[i + 1] if i + 1 < len(subtopics) else ""

        # update the current subtopic
        current_subtopic = subtopic
        
        # Find all dialogues within this subtopic
        dialogues = dialogue_pattern.findall(content_block)

        # print(dialogues)

        formatted_text = []
        speakers = []
        tstamp = None
        for dialogue in dialogues:
            speaker, timestamp, url, text = dialogue
            while tstamp == None:
                tstamp = f"[({timestamp})]({url})"

            if speaker not in speakers:
                speakers.append(speaker)

            formatted_text.append(f"{speaker}: {text} \n")
        
        current_chunk = {
            "subtopic": subtopic,
            "content": formatted_text,
            "metadata": {
                "speakers": speakers,
                "dialogue_count": len(formatted_text),
                "title": data["title"],
                "url": data["url"],
                "timestamp": tstamp
            }
        }
        chunks.append(current_chunk)
    return chunks
        

In [6]:
# test the parse_transcript_by_subtopic function
parse_transcript_by_subtopic(result)

[{'subtopic': 'Introduction',
  'content': ['Destiny: Something has to happen with Iran. There has to be some diplomatic bilateral communication there. \n',
   'Shapiro: No. What has to happen is the containment of Iran. \n',
   'Destiny: History moves in one direction. \n',
   'Shapiro: Communism, Nazism, all of that was a regression from what was happening at, for example, the beginning of the 19th century into the 20th century. \n',
   'Shapiro: Do you think that today Donald Trump knows that he lost the election? \n',
   'Destiny: This is one of the areas where we get into this, I don’t understand if there’s brain-breaking happening or what’s going on. I don’t know what world we can ever live in where we say that Trump is less divisive for the country than Biden. \n',
   'Shapiro: Joe Biden literally used the Occupational Safety and Hazard Administration to try to cram down vax mandates on 80 million Americans. That’s insane. \n',
   'Destiny: What about supercalifragilisticexpiali

In [7]:
# load transcripts

transcript_list = load_transcripts("../data")

In [9]:
# test transcript parsing 

it = parse_transcript(transcript_list[0])  # test passed for 0 , 1, 2, 3
print(it)

{'title': 'Transcript for Ben Shapiro vs Destiny Debate: Politics, Jan 6, Israel, Ukraine & Wokeism | Lex Fridman Podcast #410 - Lex Fridman', 'url': 'https://lexfridman.com/ben-shapiro-destiny-debate-transcript/', 'content': "Introduction\n------------\n\nDestiny [(00:00:00)](https://youtube.com/watch?v=tYrdMjVXyNg&t=0) Something has to happen with Iran. There has to be some diplomatic bilateral communication there.\n\nBen Shapiro [(00:00:04)](https://youtube.com/watch?v=tYrdMjVXyNg&t=4) No. What has to happen is the containment of Iran.\n\nDestiny [(00:00:06)](https://youtube.com/watch?v=tYrdMjVXyNg&t=6) History moves in one direction.\n\nBen Shapiro [(00:00:10)](https://youtube.com/watch?v=tYrdMjVXyNg&t=10) Communism, Nazism, all of that was a regression from what was happening at, for example, the beginning of the 19th century into the 20th century.\n\nBen Shapiro [(00:00:17)](https://youtube.com/watch?v=tYrdMjVXyNg&t=17) Do you think that today Donald Trump knows that he lost the 

In [10]:
# test transcript parsing by subtopic

it = parse_transcript_by_subtopic(it) # test passed for 0, 1, 2, 3
print(it)

[{'subtopic': 'Introduction', 'content': ['Destiny: Something has to happen with Iran. There has to be some diplomatic bilateral communication there. \n', 'Shapiro: No. What has to happen is the containment of Iran. \n', 'Destiny: History moves in one direction. \n', 'Shapiro: Communism, Nazism, all of that was a regression from what was happening at, for example, the beginning of the 19th century into the 20th century. \n', 'Shapiro: Do you think that today Donald Trump knows that he lost the election? \n', 'Destiny: This is one of the areas where we get into this, I don’t understand if there’s brain-breaking happening or what’s going on. I don’t know what world we can ever live in where we say that Trump is less divisive for the country than Biden. \n', 'Shapiro: Joe Biden literally used the Occupational Safety and Hazard Administration to try to cram down vax mandates on 80 million Americans. That’s insane. \n', 'Destiny: What about supercalifragilisticexpialidocious? \n', 'Shapiro:

In [None]:
import tiktoken

In [24]:
# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

def get_token_count_by_subtopic(subtopics):
    token_counts = []
    for subtopic in subtopics:
        content = ' '.join(subtopic['content'])
        tokens = tokenizer.encode(content)
        token_counts.append({
            'subtopic': subtopic['subtopic'],
            'token_count': len(tokens)
        })
    return token_counts

# test usage
token_counts = get_token_count_by_subtopic(it)
print(token_counts)

[{'subtopic': 'Introduction', 'token_count': 1356}, {'subtopic': 'Bilingualism and thinking', 'token_count': 497}, {'subtopic': 'Video prediction', 'token_count': 761}, {'subtopic': 'JEPA (Joint-Embedding Predictive Architecture)', 'token_count': 169}, {'subtopic': 'JEPA vs LLMs', 'token_count': 1235}, {'subtopic': 'DINO and I-JEPA', 'token_count': 188}, {'subtopic': 'V-JEPA', 'token_count': 477}, {'subtopic': 'Hierarchical planning', 'token_count': 1058}, {'subtopic': 'Autoregressive LLMs', 'token_count': 2167}, {'subtopic': 'AI hallucination', 'token_count': 914}, {'subtopic': 'Reasoning in AI', 'token_count': 1898}, {'subtopic': 'Reinforcement learning', 'token_count': 670}, {'subtopic': 'Woke AI', 'token_count': 486}, {'subtopic': 'Open source', 'token_count': 641}, {'subtopic': 'AI and ideology', 'token_count': 457}, {'subtopic': 'Marc Andreesen', 'token_count': 1249}, {'subtopic': 'Llama 3', 'token_count': 797}, {'subtopic': 'AGI', 'token_count': 609}, {'subtopic': 'AI doomers', 

In [25]:
def chunk_text(text, max_tokens=500, min_tokens=300):
    # Tokenize the input text
    tokens = tokenizer.encode(text)
    chunks = []
    current_chunk = []
    for token in tokens:
        current_chunk.append(token)
        # If the current chunk exceeds the max token limit
        if len(current_chunk) >= max_tokens:
            chunks.append(current_chunk)
            current_chunk = []
    # Handle the last chunk, ensure it meets the minimum size requirement
    if current_chunk:
        if len(current_chunk) < min_tokens and chunks:
            # If the last chunk is smaller than the minimum, merge it with the previous chunk
            chunks[-1].extend(current_chunk)
        else:
            chunks.append(current_chunk)
    return [tokenizer.decode(chunk) for chunk in chunks]

In [26]:
def parse_and_chunk_transcript_by_subtopic(data):
    transcript = data["content"]
    # Regex to find subtopics (e.g., Introduction, Education)
    subtopic_pattern = re.compile(r"^(.*)\n-+\n", re.MULTILINE)
    # Regex to capture speaker dialogue (e.g., Destiny [(00:00:00)]...)
    dialogue_pattern = re.compile(r"(?P<speaker>\w+)\s\[\((?P<timestamp>\d{2}:\d{2}:\d{2})\)\]\((?P<url>https:\/\/youtube\.com\/watch\?v=[^&]+&t=\d+)\)\s(?P<text>.+)")
    
    chunks = []

    subtopics = subtopic_pattern.split(transcript)

    for i in range(1, len(subtopics), 2):
        subtopic = subtopics[i].strip()
        # print(subtopic)

        content_block = subtopics[i + 1] if i + 1 < len(subtopics) else ""

        # update the current subtopic
        current_subtopic = subtopic
        
        # Find all dialogues within this subtopic
        dialogues = dialogue_pattern.findall(content_block)

        # print(dialogues)

        formatted_text = []
        speakers = []
        tstamp = None
        for dialogue in dialogues:
            speaker, timestamp, url, text = dialogue
            while tstamp == None:
                tstamp = f"[({timestamp})]({url})"

            if speaker not in speakers:
                speakers.append(speaker)

            formatted_text.append(f"{speaker}: {text} \n")
        
        # token count
        tokens_enc = tokenizer.encode(' '.join(formatted_text))
        tok_count = len(tokens_enc)

        if tok_count > 500:
            token_chunks = chunk_text(' '.join(formatted_text))
            for chunk in token_chunks:
                current_chunk = {
                    "subtopic": subtopic,
                    "content": chunk,
                    "metadata": {
                        "speakers": speakers,
                        "dialogue_count": len(chunk),
                        "title": data["title"],
                        "url": data["url"],
                        "timestamp": tstamp
                    }
                }
                chunks.append(current_chunk)
        else:
            current_chunk = {
                "subtopic": subtopic,
                "content": formatted_text,
                "metadata": {
                    "speakers": speakers,
                    "dialogue_count": len(formatted_text),
                    "title": data["title"],
                    "url": data["url"],
                    "timestamp": tstamp
                }
            }
            chunks.append(current_chunk)
    return chunks

In [27]:
# test final func

it = parse_transcript(transcript_list[2]) 
it = parse_and_chunk_transcript_by_subtopic(it)
print(it)

[{'subtopic': 'Introduction', 'content': 'Altman: I think compute is going to be the currency of the future. I think it’ll be maybe the most precious commodity in the world. I expect that by the end of this decade, and possibly somewhat sooner than that, we will have quite capable systems that we look at and say, “Wow, that’s really remarkable.” The road to AGI should be a giant power struggle. I expect that to be the case. \n Fridman: Whoever builds AGI first gets a lot of power. Do you trust yourself with that much power? \n Altman: That was definitely the most painful professional experience of my life, and chaotic and shameful and upsetting and a bunch of other negative things. There were great things about it too, and I wish it had not been in such an adrenaline rush that I wasn’t able to stop and appreciate them at the time. But I came across this old tweet of mine or this tweet of mine from that time period. It was like going your own eulogy, watching people say all these great 