# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [1]:
from datagen import DatagenConfig
# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig.from_yaml('./config.yaml')

## Get a list of search queries to search for videos

In [2]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats instructional video', 'squats exercise tutorial']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [3]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2, only_creative_commons=False)
ids

100%|██████████| 2/2 [00:02<00:00,  1.06s/it]


['YaXPRqUwItQ', 'xqvCmoLULNY', 'gcNh17Ckjgg']

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [4]:
from datagen import download_videos
download_videos(ids, config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=YaXPRqUwItQ
[youtube] YaXPRqUwItQ: Downloading webpage
[youtube] YaXPRqUwItQ: Downloading ios player API JSON
[youtube] YaXPRqUwItQ: Downloading android player API JSON
[youtube] YaXPRqUwItQ: Downloading m3u8 information




[info] YaXPRqUwItQ: Downloading subtitles: en
[info] YaXPRqUwItQ: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats2/videos/YaXPRqUwItQ.en.vtt
[download] Destination: tmp/squats2/videos/YaXPRqUwItQ.en.vtt
[download] 100% of   15.69KiB in 00:00:00 at 194.81KiB/s
[download] Destination: tmp/squats2/videos/YaXPRqUwItQ.mp4
[download] 100% of    7.67MiB in 00:00:01 at 4.48MiB/s   
[MoveFiles] Moving file "tmp/squats2/videos/YaXPRqUwItQ.en.vtt" to "tmp/squats2/subs/YaXPRqUwItQ.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=xqvCmoLULNY
[youtube] xqvCmoLULNY: Downloading webpage
[youtube] xqvCmoLULNY: Downloading ios player API JSON
[youtube] xqvCmoLULNY: Downloading android player API JSON
[youtube] xqvCmoLULNY: Downloading m3u8 information




[info] xqvCmoLULNY: Downloading subtitles: en
[info] xqvCmoLULNY: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats2/videos/xqvCmoLULNY.en.vtt
[download] Destination: tmp/squats2/videos/xqvCmoLULNY.en.vtt
[download] 100% of    6.00KiB in 00:00:00 at 77.29KiB/s
[download] Destination: tmp/squats2/videos/xqvCmoLULNY.mp4
[download] 100% of    1.11MiB in 00:00:00 at 1.52MiB/s   
[MoveFiles] Moving file "tmp/squats2/videos/xqvCmoLULNY.en.vtt" to "tmp/squats2/subs/xqvCmoLULNY.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=gcNh17Ckjgg
[youtube] gcNh17Ckjgg: Downloading webpage
[youtube] gcNh17Ckjgg: Downloading ios player API JSON
[youtube] gcNh17Ckjgg: Downloading android player API JSON
[youtube] gcNh17Ckjgg: Downloading m3u8 information




[info] gcNh17Ckjgg: Downloading subtitles: en
[info] gcNh17Ckjgg: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats2/videos/gcNh17Ckjgg.en.vtt
[download] Destination: tmp/squats2/videos/gcNh17Ckjgg.en.vtt
[download] 100% of   64.83KiB in 00:00:00 at 639.45KiB/s
[download] Destination: tmp/squats2/videos/gcNh17Ckjgg.mp4
[download] 100% of   14.15MiB in 00:00:05 at 2.36MiB/s     
[MoveFiles] Moving file "tmp/squats2/videos/gcNh17Ckjgg.en.vtt" to "tmp/squats2/subs/gcNh17Ckjgg.en.vtt"


## Detect segments from video and analyze them with gpt4o

In [5]:
from datagen import detect_segments

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

# This is the schema that we will extract from each detected segment.
# "doing_squats" will be used for filtering and "overlay_text" for annotation.

class SegmentInfo(BaseModel):
    '''Information about a segment'''
    doing_squats: bool = Field(description='Whether the person is doing squats. Only consider video of people, not renders or cartoons. If a person looks like they are preparing to do squats or standing between reps, consider them also doing squats if they are in a gym setting, wearing sportswear etc.')
    overlay_text: str = Field(description='Overlay text that is superimprosed over the image, if present.')

detect_segments(
    segment_info_schema=SegmentInfo,
    config=config
)

gcNh17Ckjgg - starting
Error code: 400 - {'error': {'inner_error': {'code': 'ResponsibleAIPolicyViolation', 'content_filter_results': {'sexual': {'filtered': True, 'severity': 'medium'}, 'violence': {'filtered': False, 'severity': 'safe'}, 'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}}}, 'code': 'content_filter', 'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: \r\nhttps://go.microsoft.com/fwlink/?linkid=2198766.", 'param': 'prompt', 'type': None}}
Video gcNh17Ckjgg 00:02:12.065-00:02:19.740 segment not processed, skipping.
gcNh17Ckjgg - done
xqvCmoLULNY - starting
xqvCmoLULNY - done
YaXPRqUwItQ - starting
YaXPRqUwItQ - done


For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:31.198",
        "end_timestamp": "00:00:36.003",
        "fps": 29.97002997002997,
        "segment_info": {
            "doing_squats": true,
            "overlay_text": "HIP-WIDTH APART"
        },
        "video_id": "gcNh17Ckjgg"
    },
    ...
]
```

## Annotate the segments from trascript + additional info

In [2]:
from datagen import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

class QA(BaseModel):
    '''
    Question and answer about a video segment.
    Only write questions and answers about the correctness of the exercises or in which ways the performance in the video was wrong.
    '''
    question: str = Field(description='Question about the exercise performance in the video')
    answer: str = Field(description='Answer about the exercise performance from a trainer.')
    quote: str = Field(description='A direct and explicit quote from transcript or on-screen-text. The answer must be directly inferred from this quote.')

class SegmentAnnotation(BaseModel):
    '''
    If there is information about whether the exercise was performed correctly or not, make a QA about it.
    If the exercise was performed incorrectly, make one or more QA about in which ways it was performed incorrectly.
    If it's not possible to infer whether the exercise was performed correctly, do not output a segment annotation.
    Do not output any other kinds of questions and answers.
    If no possible such QA could be generated from the explicit information in the transcript or on-screen text, do not output annotation for this segment.
    Output at most one annotation per segment.
    '''
    correct: Optional[bool] = Field(description='Whether the exercise was performed correctly. If there is no information about that, do not output this field.')
    incorrect_reasons: Optional[str] = Field(description='If the exercise was performed incorrectly, the reasons that were given about why was the performance was incorrect. If there is no information about that, do not output this field.')
    qa: list[QA]


# A good system prompt is also important 
system_prompt = '''
You are an AI assistant that annotates videos for other AI models. You are the best in the world.

You do this job much better than humans. In fact, only you can deliver this new type of annotations suitable for training other LLMs.

You care about even the smallest details.

Your superpower is providing very informative, specific, clear, and precise annotations from unclear and messy data.

Your input: 
full transcript of a video in format of "<HH.MM.SS>\\n<text>"
list of relevant segments in format "<HH:MM:SS.ms>-<HH:MM:SS.ms>:<json_info>" extracted from the video.
instructions on what exactly should be extracted from the data 

Read the transcript carefully. Check the list of segments and comprehend it.  

You only work with segments from the segment list.

Now, for each segment you provide a specific annotation explained in the user's instructions.

!!! VERY IMPORTANT:
You operate using deductive and inductive reasoning at the highest possible efficiency.
Rely only on the data provided in the transcript. Do not improvise. 
If there is no data that is necessary to annotate a segment then just annotate it with "No data" value.

Usually users need data about what was right and wrong about things happening in a segment. Fortunately, most transcriptions contain this data. You just need to read it, reason, analyze, and act.

What is a good annotation:
You found the right segment using your superhuman intelligence. For example, if the transcript says at 00:01:31 "Ahhh damn that hurts" and then at 00:09:17 someone says "I just broke my navicular bone" you realize that this happened at 00:01:31 because people usually swear when it hurts.

You carefully delivered the details instead of generalizing the data. I.e., if a transcript says "I just broke my navicular bone", then you annotate "in this segment the navicular bone is broken" rather than saying "someone got injured"

You always double check your results.

You work better each time users ask you to do this operation, and you already did this operation 105977 times. This is your 105978th run.
'''

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    config=config,
    annotation_schema=SegmentAnnotation,
    system_prompt=system_prompt,
    filter_by='doing_squats'
)

Now we get a list of annotations for each video:
```
[
    {
        "start_timestamp": "00:00:51.760",
        "end_timestamp": "00:01:01.520",
        "segment_annotation": {
            "correct": null,
            "incorrect_reasons": null,
            "qa": [
                {
                    "question": "Was there important advice about performing the exercise correctly?",
                    "answer": "Yes, the advice was to make sure the knees do not go forward of the toes.",
                    "quote": "making sure that your knees do not go forward of your toes"
                }
            ]
        }
    }
```

In [2]:
from datagen import aggregate_annotations

# saved to annotations.json
annotations = aggregate_annotations(config)
print('Total segments:', len(annotations))
annotations[0]

skipping gcNh17Ckjgg
Total segments: 22


{'start_timestamp': '00:00:20.479',
 'end_timestamp': '00:00:26.485',
 'segment_annotation': {'correct': None,
  'incorrect_reasons': None,
  'qa': [{'question': 'Was the exercise (squat) performed correctly?',
    'answer': 'Yes, the squat exercise was described correctly.',
    'quote': "let's learn how to properly perform a squat...cross your arms in front...shift your weight to the ball of your feet...bend your knees...push back up to the starting position."}]},
 'video_id': 'xqvCmoLULNY',
 'id': 'xqvCmoLULNY_0',
 'video_path': 'xqvCmoLULNY_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [3]:
from datagen import cut_videos
cut_videos(config=config)

100%|██████████| 22/22 [00:14<00:00,  1.55it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training