# VidXiv

### 1. ingest with pymupdf

In [2]:
import pymupdf4llm
import re

md_text = pymupdf4llm.to_markdown("data/attention_is_all_you_need.pdf")

In [3]:
# find all instances of "references" and get the last one
references_indices = [m.start() for m in re.finditer(r'references', md_text.lower())]
last_ref_index = references_indices[-1] if references_indices else -1

parsed_text = md_text[:last_ref_index]

### 2. prep and segment
- split into sections
- summarize section via llm to pull out key points

In [4]:
print(parsed_text.replace("\n\n", "\n").replace("\n\n", "\n"))

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
## **Attention Is All You Need**
**Niki Parmar** *[∗]*
Google Research
```
nikip@google.com
```
**Ashish Vaswani** *[∗]*
Google Brain
```
avaswani@google.com
```
**Llion Jones** *[∗]*
Google Research
```
 llion@google.com
```
**Noam Shazeer** *[∗]*
Google Brain
```
noam@google.com
```
**Jakob Uszkoreit** *[∗]*
Google Research
```
usz@google.com
```
**Aidan N. Gomez** *[∗†]*
University of Toronto
```
aidan@cs.toronto.edu
```
**Łukasz Kaiser** *[∗]*
Google Brain
```
lukaszkaiser@google.com
```
**Illia Polosukhin** *[∗‡]*
```
             illia.polosukhin@gmail.com
```
**Abstract**
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism.

### 3. story board
for each section:
- narration draft (bullet point of what to say)
- visual plan (diagrams/animations needed to illustrate each bulltet)

draft narration first -> then design visuals around to keep story tight

In [5]:
from dotenv import load_dotenv
import os

prompt = ""
with open("prompts/storyboard.txt", "r") as f:
    prompt = f.read()


prompt = prompt.replace("{{full_paper_text}}", parsed_text.replace("\n\n", "\n").replace("\n\n", "\n"))


In [None]:
from google import genai
import os

# load environment variables from .env.local
load_dotenv('.env.local')

api_key = os.getenv('GEMINI_API_KEY')

client = genai.Client(api_key=api_key)

response = client.models.generate_content(
    model="gemini-2.5-flash-preview-04-17",
    contents=prompt,
    temperature=0.6,
)

print(response.text)

```xml
<storyboard>
  <section>
    <title>hook</title>
    <goal>Grab the audience's attention with a surprising claim about a fundamental shift in neural network architecture.</goal>
    <narration>
      <bullet>What if you could build powerful sequence models...</bullet>
      <bullet>...without using recurrent or convolutional networks at all?</bullet>
      <bullet>This paper made a bold claim: Attention Is All You Need.</bullet>
    </narration>
    <visuals>
      <bullet>Text overlay: "RNNs?" -> Red Cross. "CNNs?" -> Red Cross.</bullet>
      <bullet>Text overlay: "Attention?" -> Green Check.</bullet>
      <bullet>Paper title appears dramatically: "Attention Is All You Need".</bullet>
    </visuals>
  </section>
  <section>
    <title>intro</title>
    <goal>Explain the limitations of dominant prior architectures and introduce the problem the paper solves.</goal>
    <narration>
      <bullet>RNNs process data one step at a time.</bullet>
      <bullet>This sequential nature 

In [8]:
import re
import json

# Regular expression to match each section block
section_pattern = re.compile(
    r"<section>\s*<title>(.*?)</title>\s*<goal>.*?</goal>\s*<narration>(.*?)</narration>\s*<visuals>(.*?)</visuals>\s*</section>",
    re.DOTALL
)

# Regular expression to match <bullet> elements inside narration or visuals
bullet_pattern = re.compile(r"<bullet>(.*?)</bullet>")

# Parse each section
storyboard_data = {}
for match in section_pattern.finditer(response.text):
    title, narration_block, visuals_block = match.groups()
    
    narration_bullets = bullet_pattern.findall(narration_block)
    visuals_bullets = bullet_pattern.findall(visuals_block)
    
    storyboard_data[title.strip()] = {
        "narration": narration_bullets,
        "visuals": visuals_bullets
    }

# Print as formatted JSON
print(json.dumps(storyboard_data, indent=2))

# Print as XML
print("\nXML Format:")
for title, data in storyboard_data.items():
    print(f"<section>")
    print(f"  <title>{title}</title>")
    print(f"  <narration>")
    for bullet in data["narration"]:
        print(f"    <bullet>{bullet}</bullet>")
    print(f"  </narration>")
    print(f"  <visuals>")
    for bullet in data["visuals"]:
        print(f"    <bullet>{bullet}</bullet>")
    print(f"  </visuals>")
    print(f"</section>")

{
  "hook": {
    "narration": [
      "What if you could build powerful sequence models...",
      "...without using recurrent or convolutional networks at all?",
      "This paper made a bold claim: Attention Is All You Need."
    ],
    "visuals": [
      "Text overlay: \"RNNs?\" -> Red Cross. \"CNNs?\" -> Red Cross.",
      "Text overlay: \"Attention?\" -> Green Check.",
      "Paper title appears dramatically: \"Attention Is All You Need\"."
    ]
  },
  "intro": {
    "narration": [
      "RNNs process data one step at a time.",
      "This sequential nature hurts parallelization and speed.",
      "CNNs capture local patterns but need many layers for long-range links.",
      "Learning distant dependencies was a core challenge."
    ],
    "visuals": [
      "Simple diagram: Input sequence (blocks A, B, C, D) -> RNN node (processing A) -> node (processing B + A state) -> ... (sequential chain).",
      "Highlight the arrow chain in the RNN diagram to show sequential dependency."

### 4. asset creation:
- voiceover:
    - finalize script, generate audio (elevenlabs)
- visuals:
    - translate each visual plan into manim ce scenes, give context of final script
    - render to short mp4 clips


In [None]:
# create visualizations with Gemini

### 5. assembly
- combine clips with voiceover (moviepy or ffmpeg)
- add titles, fades, transitions if you want

### 6. export mp4 + QA