# Data Generation SDK Example usage

We are going to generate a dataset of squat videos with instructions to train an LLM.

In [4]:
import datagen
from datagen.core.config import DatagenConfig
config = DatagenConfig.from_yaml('./config.yaml')

## Get a list of search queries to use

In [3]:
from datagen.search import get_queries, get_video_info
queries = get_queries(config=config, prompt='I want to find instructional videos about how to do squats.', num_queries=5)
print(len(queries))
queries

5


['how to do squats',
 'squat exercises for beginners',
 'proper squat form',
 'squat variations',
 'how to squat correctly']

## Download video information for each query.

There is a lot of useful information to filter the videos at this stage if necessary, but we will only use video ids later.<br>
Videos will be deduplicated so we don't need to download the same video multiple times.

In [None]:
df = get_video_info(queries, videos_per_query=10)

In [5]:
df.head()

Unnamed: 0,id,id.1,title,formats,thumbnails,thumbnail,description,channel_id,channel_url,duration,...,vbr,stretched_ratio,aspect_ratio,acodec,abr,asr,audio_channels,query,location,queries
0,4KmY44Xsg2w,4KmY44Xsg2w,The Basic Squat - Balance Exercise - CORE Chir...,"[{'format_id': 'sb2', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/4KmY44Xsg2w/3...,https://i.ytimg.com/vi_webp/4KmY44Xsg2w/maxres...,Dr. Natalie Cordova demonstrates how to do a b...,UCW6EenBHb_KF-eaFRE3gXnA,https://www.youtube.com/channel/UCW6EenBHb_KF-...,173,...,707.089,,1.78,opus,93.518,48000,2,squat exercises for beginners,CORE CHIROPRACTIC,[['squat exercises for beginners']]
1,AIZ8q1qruKw,AIZ8q1qruKw,How to Perform a PERFECT Squat,"[{'format_id': 'sb1', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/AIZ8q1qruKw/o...,https://i.ytimg.com/vi/AIZ8q1qruKw/sd2.jpg?sqp...,Get my book on fixing injury here: \nhttps://w...,UCyPYQTT20IgzVw92LDvtClw,https://www.youtube.com/channel/UCyPYQTT20IgzV...,59,...,649.69,,0.56,opus,108.711,48000,2,how to do squats,,[['how to do squats']]
2,C73Y3EsJWIk,C73Y3EsJWIk,Top 10 BEST SQUATS Variations,"[{'format_id': 'sb2', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/C73Y3EsJWIk/3...,https://i.ytimg.com/vi_webp/C73Y3EsJWIk/maxres...,Top 10 Best Squat Exercises:\n\nHigh Bar Squat...,UCKf0UqBiCQI4Ol0To9V0pKQ,https://www.youtube.com/channel/UCKf0UqBiCQI4O...,479,...,1118.83,,1.78,mp4a.40.2,129.483,44100,2,squat variations,,[['squat variations']]
3,EbOPpWi4L8s,EbOPpWi4L8s,How to Do Squats for Beginners,"[{'format_id': 'sb3', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/EbOPpWi4L8s/3...,https://i.ytimg.com/vi/EbOPpWi4L8s/maxresdefau...,How to Do Squats for Beginners. Part of the se...,UCE8wCVw_ZfRw-D6RJ5EXWbw,https://www.youtube.com/channel/UCE8wCVw_ZfRw-...,97,...,415.467,,1.78,opus,106.487,48000,2,squat exercises for beginners,,[['squat exercises for beginners']]
4,EzvnMZuxGWw,EzvnMZuxGWw,Perfect Squat Form in 3 Steps!,"[{'format_id': 'sb2', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/EzvnMZuxGWw/3...,https://i.ytimg.com/vi/EzvnMZuxGWw/maxresdefau...,,UCyPYQTT20IgzVw92LDvtClw,https://www.youtube.com/channel/UCyPYQTT20IgzV...,60,...,1299.342,,0.56,opus,127.322,48000,2,proper squat form,,"[['proper squat form'], ['squat variations']]"


In [13]:
# For 5 queries times 10 videos per query = 50 videos, we got 28 unique videos
len(df)

28

In [10]:
df.to_csv(config.data_dir / 'video_info.csv')

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [None]:
from datagen.download_videos import download_videos
download_videos(df['id'], config)

## Detect segments from video and analyze them with gpt4o

In [5]:
from datagen.detect_segments import detect_segments

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

# This is the schema that we will extract from each detected segment that will be also used during annotation.
# If you want the annotations to focus on the transcript, do not extract too much visual information here that might distract the LLM during annotation.
# We will use "doing_squats" for filtering,
# and overlay_text for annotation, although it might add noise.

class SegmentInfo(BaseModel):
    '''Information about a segment'''
    doing_squats: bool = Field(description='Whether the person is doing squats. Only consider video of people, not renders or cartoons. If a person looks like they are preparing to do squats or standing between reps, consider them also doing squats if they are in a gym setting, wearing sportswear etc.')
    overlay_text: str = Field(description='Overlay text that is superimprosed over the image, if present.')
    # clothes: str = Field(description='Clothes of the person doing squats in detail.')
    # image_description: str = Field(description="Describe the image in detail")


segments = []
for video in config.get_videos():
        print(video.stem)
        segments.append(detect_segments(
            video_id=video.stem,
            segment_info_schema=SegmentInfo,
            detection_algorithm=None, # default AdaptiveDetector algorithm is good for most types of video. https://www.scenedetect.com/docs/latest/api/detectors.html
            min_duration=1, max_duration=60, # discard too long or too short segments to save some GPT calls
            frames_per_segment=1, # how many frames per segment we will use for detection. More frames will be more accurate and capture more information (eg changing overlay text), but also longer and more expensive.
            config=config))


byxWus7BwfQ
OTyb4YUDYYY
w8ZhgecdIAM
T6id8FuUcao
a3aw-5vDM2E
AIZ8q1qruKw
C73Y3EsJWIk
EbOPpWi4L8s
SLOkdLLWj8A
EzvnMZuxGWw
PJj5shV4uYo
MM9ObaAPcv4
HgDZlNQrifY
LFkinX12jtU
LF4zb2SYWjQ
xawAf5fXD2c
MLoZuAkIyZI
TH6jSCGnowI
HZilSL4ZNvQ
gslEzVggur8
cRxg-PUAT6I
IB_icWRzi4E
my0tLDaWyDU
4KmY44Xsg2w
rhbIFJj4UYc
jhb_nnV29EU
iZTxa8NJH2g
PPmvh7gBTi0


In [7]:
# 28 videos took 14.5 minutes w/1 frame/segment
# about 30sec/video
!ls tmp/squats/segments | wc -l

      28


## Annotate the segments from trascript + additional info

In [8]:
segments = config.get_segments(info_type=SegmentInfo)
print(len(segments))
# we're only interested in segments where people do squats
segments = [x for x in segments if x.segment_info['doing_squats']]
print(len(segments))
segments[0]

357
227


Segment(start_timestamp='00:00:00.000', end_timestamp='00:00:02.800', fps=30.0, segment_info={'doing_squats': True, 'overlay_text': 'TOP THREE ONE LEGGED SQUATS'}, video_id='a3aw-5vDM2E')

In [9]:
from datagen.annotate import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This schema will be detected for each segment.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

class QA(BaseModel):
    '''
    Question and answer about a video segment.
    Only write questions and answers about the correctness of the exercises or in which ways the performance in the video was wrong.
    '''
    question: str = Field(description='Question about the exercise performance in the video')
    answer: str = Field(description='Answer about the exercise performance from a trainer.')
    quote: str = Field(description='A direct and explicit quote from transcript or on-screen-text. The answer must be directly inferred from this quote.')

    # Valid QA:
    # - Was the exercise performed correctly? -No.
    # - How can I improve my form? -Set your knees wider.
    # Invalid QA:
    # - Was the exercise perfomed correctly? -There is no information about that.
    # - What is written on screen? - The word Hello is written on screen.

class SegmentAnnotation(BaseModel):
    '''
    If there is information about whether the exercise was performed correctly or not, make a QA about it.
    If the exercise was performed incorrectly, make one or more QA about in which ways it was performed incorrectly.
    If it's not possible to infer whether the exercise was performed correctly, do not output a segment annotation.
    Do not output any other kinds of questions and answers.
    If no possible such QA could be generated from the explicit information in the transcript or on-screen text, do not output annotation for this segment.
    Output at most one annotation per segment.
    '''
    correct: Optional[bool] = Field(description='Whether the exercise was performed correctly. If there is no information about that, do not output this field.')
    incorrect_reasons: Optional[str] = Field(description='If the exercise was performed incorrectly, the reasons that were given about why was the performance was incorrect. If there is no information about that, do not output this field.')
    qa: list[QA]

annotations = generate_annotations(segments=segments, config=config, annotation_schema=SegmentAnnotation)


In [34]:
from datagen.annotate import aggregate_annotations

# here we filter the segments to only leave those that have correctness information and some QA
filter_func = lambda seg: seg['correct'] is not None and len(seg['qa'])
annotations = aggregate_annotations(config, filter_func=filter_func)
print('Total segments:', len(annotations))
annotations[0]

skipping AIZ8q1qruKw
skipping iZTxa8NJH2g
Total segments: 25


{'start_timestamp': '00:00:26.933',
 'end_timestamp': '00:00:28.467',
 'segment_annotation': {'correct': False,
  'incorrect_reasons': 'Incorrect stance',
  'qa': [{'question': 'Was the exercise performed correctly?',
    'answer': 'No',
    'quote': 'X, ✓'}]},
 'video_id': 'OTyb4YUDYYY',
 'id': 'OTyb4YUDYYY_0',
 'video_path': 'OTyb4YUDYYY_0.mp4'}

In [35]:
import json
with open(config.data_dir / 'annotations.json', 'w') as f:
    json.dump(annotations, f)

## The last step is to cut video clips for these segments from original videos

In [32]:
from datagen.cut_videos import cut_videos
cut_videos(annotations, config=config)

100%|██████████| 25/25 [00:22<00:00,  1.13it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training