# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [1]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig

config_params = {
    'openai': {
        'type': 'azure', # openai/azure
        'temperature': '1',
        'deployment': 'gpt4o' # model for openai / deployment for azure
    },
    'data_dir': './tmp/squats'
}

!mkdir -p {config_params['data_dir']}

# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig(**config_params)

## Get a list of search queries to search for videos

In [2]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats', 'squat exercise tutorial']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [3]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2,only_creative_commons=False)
ids

  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:01<00:00,  1.50it/s]


['EbOPpWi4L8s', 'gcNh17Ckjgg', 'xqvCmoLULNY', 'IB_icWRzi4E']

In [2]:
import json
with open(config.data_dir / 'video_ids.json') as f:
    ids = json.load(f)

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [3]:
from datagen import download_videos
# a proxy is necessary if getting "Sign in to confirm you’re not a bot."
download_videos(ids, config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=gcNh17Ckjgg
[youtube] gcNh17Ckjgg: Downloading webpage


[youtube] gcNh17Ckjgg: Downloading ios player API JSON
[youtube] gcNh17Ckjgg: Downloading web creator player API JSON


ERROR: [youtube] gcNh17Ckjgg: Sign in to confirm you’re not a bot. This helps protect our community. Learn more


2024-08-09 11:23:14.371408 Error at video gcNh17Ckjgg, skipping
2024-08-09 11:23:14.371447 ERROR: [youtube] gcNh17Ckjgg: Sign in to confirm you’re not a bot. This helps protect our community. Learn more
[youtube] Extracting URL: https://www.youtube.com/watch?v=IB_icWRzi4E
[youtube] IB_icWRzi4E: Downloading webpage
[youtube] IB_icWRzi4E: Downloading ios player API JSON
[youtube] IB_icWRzi4E: Downloading web creator player API JSON


ERROR: [youtube] IB_icWRzi4E: Sign in to confirm you’re not a bot. This helps protect our community. Learn more


2024-08-09 11:23:14.974700 Error at video IB_icWRzi4E, skipping
2024-08-09 11:23:14.974742 ERROR: [youtube] IB_icWRzi4E: Sign in to confirm you’re not a bot. This helps protect our community. Learn more
[youtube] Extracting URL: https://www.youtube.com/watch?v=xqvCmoLULNY
[youtube] xqvCmoLULNY: Downloading webpage
[youtube] xqvCmoLULNY: Downloading ios player API JSON
[youtube] xqvCmoLULNY: Downloading web creator player API JSON


ERROR: [youtube] xqvCmoLULNY: Sign in to confirm you’re not a bot. This helps protect our community. Learn more


2024-08-09 11:23:15.656008 Error at video xqvCmoLULNY, skipping
2024-08-09 11:23:15.656044 ERROR: [youtube] xqvCmoLULNY: Sign in to confirm you’re not a bot. This helps protect our community. Learn more
[youtube] Extracting URL: https://www.youtube.com/watch?v=EbOPpWi4L8s
[youtube] EbOPpWi4L8s: Downloading webpage
[youtube] EbOPpWi4L8s: Downloading ios player API JSON
[youtube] EbOPpWi4L8s: Downloading web creator player API JSON


ERROR: [youtube] EbOPpWi4L8s: Sign in to confirm you’re not a bot. This helps protect our community. Learn more


2024-08-09 11:23:16.219252 Error at video EbOPpWi4L8s, skipping
2024-08-09 11:23:16.219294 ERROR: [youtube] EbOPpWi4L8s: Sign in to confirm you’re not a bot. This helps protect our community. Learn more


## Detect segments from video

We will use the clip version because it's much faster than gpt4o, but we'll need a GPU.
You can also try using CPU for debugging

In [4]:
from transformers import AutoProcessor, AutoModel

# remove .cuda() for cpu
# SIGLIP outputs independent probs as opposed to CLIP that outputs multiclass probs
device = 'cuda' # or 'cuda:0'
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

In [6]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

detect_segments_clip(
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats', # that's the text for CLIP to compare to images. You can provide a list of texts to use average distance.
    model=model,
    processor=processor,
    fps_sampling=2, # the more fps, the more granular segment borders and more precise segments, at the cost of speed.
    device='cuda', # 'cpu' for local
    frames_per_batch=100, # 100 frames use about 10GB GPU RAM, so batch to fill your GPU RAM.
    config=config,
)

  0%|          | 0/4 [00:00<?, ?it/s]

2024-08-09 11:29:42.322734 grabbing video EbOPpWi4L8s: 193 frames


running clip on batch [(0, 100, 'EbOPpWi4L8s')]...
2024-08-09 11:30:03.028306 grabbing video xqvCmoLULNY: 97 frames
running clip on batch [(0, 93, 'EbOPpWi4L8s'), (93, 100, 'xqvCmoLULNY')]...


 25%|██▌       | 1/4 [00:39<01:58, 39.37s/it]

video EbOPpWi4L8s completed - 4 segments detected
[Segment(start_timestamp='00:00:04.250', end_timestamp='00:00:13.250', fps=23.976023976023978, segment_info=None, video_id='EbOPpWi4L8s'), Segment(start_timestamp='00:00:35.750', end_timestamp='00:00:52.750', fps=23.976023976023978, segment_info=None, video_id='EbOPpWi4L8s'), Segment(start_timestamp='00:01:01.250', end_timestamp='00:01:06.750', fps=23.976023976023978, segment_info=None, video_id='EbOPpWi4L8s'), Segment(start_timestamp='00:01:08.250', end_timestamp='00:01:17.250', fps=23.976023976023978, segment_info=None, video_id='EbOPpWi4L8s')]
2024-08-09 11:30:22.492963 grabbing video gcNh17Ckjgg: 870 frames
running clip on batch [(0, 90, 'xqvCmoLULNY'), (90, 100, 'gcNh17Ckjgg')]...


 75%|███████▌  | 3/4 [00:59<00:17, 17.48s/it]

video xqvCmoLULNY completed - 2 segments detected
[Segment(start_timestamp='00:00:19.750', end_timestamp='00:00:27.250', fps=23.976023976023978, segment_info=None, video_id='xqvCmoLULNY'), Segment(start_timestamp='00:00:28.750', end_timestamp='00:00:41.750', fps=23.976023976023978, segment_info=None, video_id='xqvCmoLULNY')]
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
running clip on batch [(0, 100, 'gcNh17Ckjgg')]...
2024-08-09 11:33:23.247812 grabbing video bEv6CCg2BC8: 1318 frames
running clip on batch [(0, 60, 'gcNh17Ckjgg'), (60, 100, 'bEv6CCg2BC8')]...


6it [04:00, 43.86s/it]                       

video gcNh17Ckjgg completed - 14 segments detected
[Segment(start_timestamp='00:00:00.750', end_timestamp='00:00:33.250', fps=29.97002997002997, segment_info=None, video_id='gcNh17Ckjgg'), Segment(start_timestamp='00:00:40.750', end_timestamp='00:00:52.250', fps=29.97002997002997, segment_info=None, video_id='gcNh17Ckjgg'), Segment(start_timestamp='00:01:41.750', end_timestamp='00:01:48.750', fps=29.97002997002997, segment_info=None, video_id='gcNh17Ckjgg'), Segment(start_timestamp='00:01:56.750', end_timestamp='00:02:04.250', fps=29.97002997002997, segment_info=None, video_id='gcNh17Ckjgg'), Segment(start_timestamp='00:02:16.250', end_timestamp='00:02:20.250', fps=29.97002997002997, segment_info=None, video_id='gcNh17Ckjgg'), Segment(start_timestamp='00:02:27.750', end_timestamp='00:02:38.250', fps=29.97002997002997, segment_info=None, video_id='gcNh17Ckjgg'), Segment(start_timestamp='00:02:40.250', end_timestamp='00:02:54.750', fps=29.97002997002997, segment_info=None, video_id='gcNh

10it [08:15, 49.54s/it]

video bEv6CCg2BC8 completed - 13 segments detected
[Segment(start_timestamp='00:00:10.250', end_timestamp='00:00:30.750', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:00:58.250', end_timestamp='00:01:15.250', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:01:25.250', end_timestamp='00:01:55.750', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:02:03.750', end_timestamp='00:02:24.750', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:02:29.750', end_timestamp='00:02:43.250', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:03:03.750', end_timestamp='00:03:43.750', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:03:54.750', end_timestamp='00:04:46.750', fps=24.0, segment_info=None, video_id='bEv6CCg2BC8'), Segment(start_timestamp='00:04:53.750', end_timestamp='00:06:08.750', fps=24.0, 




For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:32.500",
        "end_timestamp": "00:00:41.500",
        "fps": 29.97002997002997,
        "segment_info": null, # not used with clip, but could be used with gpt4o
        "video_id": "KvRK5Owqzgw"
    },
    ...
]
```

## Annotaion step 1: extract information (clues) from transcript

In [8]:
from datagen import generate_clues

human_prompt = """
The provided video is a tutorial about how to perform squats. 

I need to understand HOW THE PERSON SHOWN IN EACH SEGMENT PERFORMS SQUATS IN THIS SEGMENT.
What is done correctly.
What mistakes they make. Why these mistakes happen.
How these mistakes could be improved.

It is very improtant that the information that you provide would describe how the person shown in the segment is doing squats, and not some generic advice that is unrelated to the visual information.
"""

from time import sleep
clues = generate_clues(
    # video_ids=['byxWus7BwfQ'],
    config=config,
    human_prompt=human_prompt,
    segments_per_call=5, # the output might be quite long, so need to limit number of segments per gpt call to respect max output legnth
    raise_on_error=True, # interrupt when encountering an error. Useful for debugging.
)



2024-08-09 11:40:24.266342 bEv6CCg2BC8 - started
2024-08-09 11:40:24.266955 bEv6CCg2BC8 part 0 - started


2024-08-09 11:40:46.369636 bEv6CCg2BC8 part 1 - started
2024-08-09 11:41:27.258525 bEv6CCg2BC8 part 2 - started




2024-08-09 11:41:29.519412 bEv6CCg2BC8 - done
2024-08-09 11:41:29.521122 gcNh17Ckjgg - started
2024-08-09 11:41:29.522229 gcNh17Ckjgg part 0 - started
2024-08-09 11:41:47.972894 gcNh17Ckjgg part 1 - started
2024-08-09 11:42:19.042026 gcNh17Ckjgg part 2 - started




2024-08-09 11:42:48.376612 gcNh17Ckjgg - done
2024-08-09 11:42:48.378026 xqvCmoLULNY - started
2024-08-09 11:42:48.379028 xqvCmoLULNY part 0 - started




2024-08-09 11:43:08.990347 xqvCmoLULNY - done
2024-08-09 11:43:08.992046 EbOPpWi4L8s - started
2024-08-09 11:43:08.993245 EbOPpWi4L8s part 0 - started


100%|██████████| 4/4 [03:04<00:00, 46.05s/it]

2024-08-09 11:43:28.475395 EbOPpWi4L8s - done





## Annotaion step 2: extract information from transcript

In [9]:
from datagen import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.


human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "high", "medium", "low", or null (if impossible to infer from the provided data)]
2. Given the data found in the JSON object and even if the answer on the previous question is "low", does this person do squats right, wrong, or mixed? [the answer could be only "right", "wrong", "mixed", or null (if impossible to infer from the provided data)]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

class SegmentFeedback(BaseModel):
    '''
—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[str] = Field(description='correctness of the squat technique.')
    squats_feedback: Optional[SegmentFeedback] = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    human_prompt=human_prompt,
    config=config,
    segments_per_call=5,
    annotation_schema=SegmentAnnotation,
)



2024-08-09 11:43:36.608390 bEv6CCg2BC8 - started




2024-08-09 11:43:51.283055 bEv6CCg2BC8 - done
2024-08-09 11:43:51.284728 gcNh17Ckjgg - started




2024-08-09 11:44:03.635522 gcNh17Ckjgg - done
2024-08-09 11:44:03.637116 xqvCmoLULNY - started




2024-08-09 11:44:07.351109 xqvCmoLULNY - done
2024-08-09 11:44:07.352673 EbOPpWi4L8s - started


100%|██████████| 4/4 [00:40<00:00, 10.16s/it]

2024-08-09 11:44:17.226695 EbOPpWi4L8s - done





In [10]:
from datagen import aggregate_annotations

def filter_annotations(ann):
    if ann['squats_probability'] in [None, 'low', 'None', 'null']:
        # if we're not able to infer probability or prob is low, we don't need it
        return False
    if ann['squats_technique_correctness'] in [None, 'null', 'None']:
        # if we couldnt establish correctness at all, the feedback is probably useless
        return False
    if ann['squats_technique_correctness'] in ['mixed']:
        # discard empty segment if correctness isn't clear since there isn't any information to use for training
        if ann['squats_feedback'] is None:
            return False
        if set(ann['squats_feedback'].values()) == set([None]):
            return False
    return True

annotations = aggregate_annotations(config, filter_func=filter_annotations, annotation_file='annotations.json')
print('Total segments:', len(annotations))
annotations[0]

100%|██████████| 4/4 [00:00<00:00, 7225.33it/s]

Total segments: 10





{'start_timestamp': '00:03:31.750',
 'end_timestamp': '00:03:54.250',
 'segment_annotation': {'squats_probability': 'high',
  'squats_technique_correctness': 'right',
  'squats_feedback': {'right': 'Controlled descent, proper knee and butt positioning, upper back tightness, deep and comfortable squat.',
   'wrong': None,
   'correction': None}},
 'video_id': 'gcNh17Ckjgg',
 'id': 'gcNh17Ckjgg_0',
 'video_path': 'gcNh17Ckjgg_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [11]:
from datagen import cut_videos
cut_videos(config=config)



100%|██████████| 10/10 [00:28<00:00,  2.86s/it]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training