# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [1]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig
# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig.from_yaml('./config.yaml')

## Get a list of search queries to search for videos

In [4]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=5
)
queries

['how to do squats',
 'squat exercise tutorial',
 'beginner guide to squats',
 'proper squat form',
 'squat workout video']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [5]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=5, only_creative_commons=False)
ids

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:03<00:00,  1.26it/s]


['KJ8xAMJdZjQ',
 'ubdIGnX2Hfs',
 'YaXPRqUwItQ',
 'l83R5PblSMA',
 'irfw1gQ0foQ',
 'dCHLUtf--pg',
 'PPmvh7gBTi0',
 'EbOPpWi4L8s',
 '4KmY44Xsg2w',
 '3qkgrJNB6kA',
 'IB_icWRzi4E',
 'HFnSsLIB7a4',
 'xqvCmoLULNY',
 'LSj280OEKUI',
 'gcNh17Ckjgg',
 'p-R0HSfL6nw',
 'DGhHgiCfAb0',
 '_uZLFUnKSaM',
 'byxWus7BwfQ',
 'xuf1czJv-XI']

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [6]:
from datagen import download_videos
download_videos(ids, config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=PPmvh7gBTi0
[youtube] PPmvh7gBTi0: Downloading webpage


[youtube] PPmvh7gBTi0: Downloading ios player API JSON
[youtube] PPmvh7gBTi0: Downloading tv player API JSON
[youtube] PPmvh7gBTi0: Downloading player d2e656ee
[youtube] PPmvh7gBTi0: Downloading m3u8 information
[info] PPmvh7gBTi0: Downloading subtitles: en
[info] PPmvh7gBTi0: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/PPmvh7gBTi0.en.vtt
[download] Destination: tmp/squats/videos/PPmvh7gBTi0.en.vtt
[download] 100% of   10.23KiB in 00:00:00 at 265.22KiB/s
[download] Destination: tmp/squats/videos/PPmvh7gBTi0.mp4
[download] 100% of    3.71MiB in 00:00:00 at 12.52MiB/s  
[MoveFiles] Moving file "tmp/squats/videos/PPmvh7gBTi0.en.vtt" to "tmp/squats/subs/PPmvh7gBTi0.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=KJ8xAMJdZjQ
[youtube] KJ8xAMJdZjQ: Downloading webpage
[youtube] KJ8xAMJdZjQ: Downloading ios player API JSON
[youtube] KJ8xAMJdZjQ: Downloading tv player API JSON
[youtube] KJ8xAMJdZjQ: Downloading m3u8 information
[info] KJ8x

## Detect segments from video

We will use the clip version because it's much faster than gpt4o, but we'll need a GPU.
You can also try using CPU for debugging

In [2]:
from transformers import AutoProcessor, AutoModel

# remove .cuda() for cpu
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").cuda()
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

In [9]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

detect_segments_clip(
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats', # that's the text for CLIP to compare to images. You can provide a list of texts to use the average distance.
    model=model,
    processor=processor,
    fps_sampling=2, # the more fps, the more granular segment borders and more precise segments, at the cost of speed.
    device='cuda', # 'cpu' for local
    frames_per_batch=100, # 100 frames use about 10GB GPU RAM, so batch to fill your GPU RAM.
    config=config,

    # Parameters for segment detection from probabilities - these default values should work well, but if they produce bad results for specific kinds of videos, you can adjust them.
    min_prob=0.1, # minimum CLIP probability to consider the match
    max_gap_seconds=1, # gaps of prob < min_prob that could be inside segment
    min_segment_seconds=3, # discard very short segments
    smooth_fraction=0.02, # smoothing strength. Raw probabilities are smoothed to adapt to fluctuations between frames.
)

  0%|          | 0/13 [00:00<?, ?it/s]

HFnSsLIB7a4 - starting


  8%|▊         | 1/13 [00:41<08:18, 41.50s/it]

probs (743,) frames 743
ubdIGnX2Hfs - starting


 15%|█▌        | 2/13 [01:25<07:53, 43.05s/it]

probs (825,) frames 825
p-R0HSfL6nw - starting


 23%|██▎       | 3/13 [02:39<09:30, 57.01s/it]

probs (1372,) frames 1372
byxWus7BwfQ - starting


 31%|███       | 4/13 [03:00<06:27, 43.08s/it]

probs (393,) frames 393
EbOPpWi4L8s - starting


 38%|███▊      | 5/13 [03:11<04:10, 31.26s/it]

probs (193,) frames 193
KJ8xAMJdZjQ - starting


 46%|████▌     | 6/13 [03:18<02:41, 23.13s/it]

probs (133,) frames 133
dCHLUtf--pg - starting


 54%|█████▍    | 7/13 [04:03<03:00, 30.15s/it]

probs (810,) frames 810
l83R5PblSMA - starting


 62%|██████▏   | 8/13 [04:05<01:45, 21.13s/it]

probs (33,) frames 33
xuf1czJv-XI - starting


 69%|██████▉   | 9/13 [04:14<01:09, 17.40s/it]

probs (170,) frames 170
LSj280OEKUI - starting


 77%|███████▋  | 10/13 [05:29<01:46, 35.40s/it]

probs (1410,) frames 1410
irfw1gQ0foQ - starting


 85%|████████▍ | 11/13 [06:41<01:32, 46.31s/it]

probs (1284,) frames 1284
3qkgrJNB6kA - starting
probs (8513,) frames 8513


 92%|█████████▏| 12/13 [14:43<02:58, 178.98s/it]

DGhHgiCfAb0 - starting


100%|██████████| 13/13 [16:05<00:00, 74.25s/it] 

probs (1521,) frames 1521





For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:32.500",
        "end_timestamp": "00:00:41.500",
        "fps": 29.97002997002997,
        "segment_info": null, # not used with clip, but could be used with gpt4o
        "video_id": "KvRK5Owqzgw"
    },
    ...
]
```

## Annotaion step 1: extract information from transcript

In [17]:
from datagen.clues import generate_clues_dataclass
SegmentAnnotationPlov = generate_clues_dataclass(prompt='making plov', config=config)

from pprint import pprint
pprint(SegmentAnnotationPlov.schema())

{'definitions': {'AdditionalInformation': {'description': 'Good logical '
                                                          'inference '
                                                          'examples:\n'
                                                          '[\n'
                                                          '  {\n'
                                                          '    "id": "LI1",\n'
                                                          '    "description": '
                                                          '"Primary '
                                                          'Demonstration of '
                                                          'Heel Lift",\n'
                                                          '    "details": '
                                                          '"Given that GC1-GC3 '
                                                          "describe the 'most "
                                   

In [2]:
from pprint import pprint
from datagen.clues import generate_clues_dataclass
SegmentAnnotationSquats = generate_clues_dataclass(prompt='improving squat exercise technique', config=config)
pprint(SegmentAnnotationSquats.schema())

{'definitions': {'AdditionalInformation': {'description': 'Good logical '
                                                          'inference '
                                                          'examples:\n'
                                                          '[\n'
                                                          '  {\n'
                                                          '    "id": "LI1",\n'
                                                          '    "description": '
                                                          '"Primary '
                                                          'Demonstration of '
                                                          'Heel Lift",\n'
                                                          '    "details": '
                                                          '"Given that GC1-GC3 '
                                                          "describe the 'most "
                                   

In [3]:
from datagen import generate_clues

clues = generate_clues(
    # video_ids=['4KmY44Xsg2w'],
    config=config,
    annotation_schema=SegmentAnnotationSquats,
    # human_prompt=human_prompt,
    segments_per_call=10,
    raise_on_error=True, # interrupt when encountering an error. Useful for debugging.
)

  0%|          | 0/21 [00:00<?, ?it/s]

byxWus7BwfQ - started
byxWus7BwfQ part 0 - started


  5%|▍         | 1/21 [00:09<03:12,  9.62s/it]

byxWus7BwfQ - done
3qkgrJNB6kA - started
3qkgrJNB6kA part 0 - started


 10%|▉         | 2/21 [00:32<05:33, 17.57s/it]

3qkgrJNB6kA - done
irfw1gQ0foQ - started
irfw1gQ0foQ part 0 - started


 14%|█▍        | 3/21 [00:47<04:49, 16.11s/it]

irfw1gQ0foQ - done
4KmY44Xsg2w - started
4KmY44Xsg2w part 0 - started


 19%|█▉        | 4/21 [01:01<04:23, 15.49s/it]

4KmY44Xsg2w - done
l83R5PblSMA - started
l83R5PblSMA part 0 - started


 24%|██▍       | 5/21 [01:03<02:46, 10.39s/it]

l83R5PblSMA - done
p-R0HSfL6nw - started
p-R0HSfL6nw part 0 - started


 29%|██▊       | 6/21 [01:05<01:57,  7.82s/it]

p-R0HSfL6nw - done
xqvCmoLULNY - started
xqvCmoLULNY part 0 - started


 33%|███▎      | 7/21 [01:16<02:04,  8.87s/it]

xqvCmoLULNY - done
dCHLUtf--pg - started
dCHLUtf--pg part 0 - started


 38%|███▊      | 8/21 [01:28<02:08,  9.89s/it]

dCHLUtf--pg - done
LSj280OEKUI - started
LSj280OEKUI part 0 - started


 43%|████▎     | 9/21 [01:50<02:40, 13.38s/it]

LSj280OEKUI - done
KvRK5Owqzgw - started
KvRK5Owqzgw part 0 - started


 48%|████▊     | 10/21 [02:21<03:28, 18.99s/it]

KvRK5Owqzgw - done
EbOPpWi4L8s - started
EbOPpWi4L8s part 0 - started


 52%|█████▏    | 11/21 [02:29<02:35, 15.60s/it]

EbOPpWi4L8s - done
DGhHgiCfAb0 - started
DGhHgiCfAb0 part 0 - started


 57%|█████▋    | 12/21 [02:32<01:46, 11.83s/it]

DGhHgiCfAb0 - done
xuf1czJv-XI - started
xuf1czJv-XI part 0 - started


 62%|██████▏   | 13/21 [02:46<01:38, 12.37s/it]

xuf1czJv-XI - done
ubdIGnX2Hfs - started
ubdIGnX2Hfs part 0 - started


 67%|██████▋   | 14/21 [03:00<01:30, 12.90s/it]

ubdIGnX2Hfs - done
gcNh17Ckjgg - started
gcNh17Ckjgg part 0 - started
gcNh17Ckjgg part 1 - started


 71%|███████▏  | 15/21 [04:14<03:08, 31.37s/it]

gcNh17Ckjgg - done
KJ8xAMJdZjQ - started
KJ8xAMJdZjQ part 0 - started


 76%|███████▌  | 16/21 [04:40<02:28, 29.74s/it]

KJ8xAMJdZjQ - done
PPmvh7gBTi0 - started
PPmvh7gBTi0 part 0 - started


 81%|████████  | 17/21 [05:18<02:08, 32.06s/it]

PPmvh7gBTi0 - done
IB_icWRzi4E - started
IB_icWRzi4E part 0 - started


 86%|████████▌ | 18/21 [05:26<01:15, 25.11s/it]

IB_icWRzi4E - done
_uZLFUnKSaM - started
_uZLFUnKSaM part 0 - started


 90%|█████████ | 19/21 [05:38<00:42, 21.13s/it]

_uZLFUnKSaM - done
HFnSsLIB7a4 - started
HFnSsLIB7a4 part 0 - started


 95%|█████████▌| 20/21 [05:58<00:20, 20.74s/it]

HFnSsLIB7a4 - done
YaXPRqUwItQ - started
YaXPRqUwItQ part 0 - started


100%|██████████| 21/21 [06:12<00:00, 17.74s/it]

YaXPRqUwItQ - done





## Annotaion step 2: extract information from transcript

In [4]:
from datagen import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

import inspect

human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "High","Medium" or "Low"]
2. Given the data found in the JSON object and even if the answer on the previous question is "Low", does this person do squats right, wrong, or mixed? [the answer could be only "Right", "Wrong", and "Mixed"]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

class SegmentFeedback(BaseModel):
    '''
    You are a fitness trainer giving feedback on what was right, wrong, and what could be improved.
    Talk as you would talk to a trainee, but avoid excessive language or irrelevant banter.

—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    '''
    This annotation is generated exclusively from the provided information about this specific segment.
    Dont pay attention to information about other segments.
    '''
    # squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[bool] = Field(description='bollean correctness of the squat technique.')
    squats_feedback: SegmentFeedback = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    human_prompt=human_prompt,
    config=config,
    annotation_schema=SegmentAnnotation,
    # filter_by='doing_squats'
)

  0%|          | 0/1 [00:00<?, ?it/s]

KvRK5Owqzgw - started


100%|██████████| 1/1 [00:03<00:00,  3.83s/it]

KvRK5Owqzgw - done





Now we get a list of annotations for each video:
```
[
    {
        "start_timestamp": "00:00:51.760",
        "end_timestamp": "00:01:01.520",
        "segment_annotation": {
            "correct": null,
            "incorrect_reasons": null,
            "qa": [
                {
                    "question": "Was there important advice about performing the exercise correctly?",
                    "answer": "Yes, the advice was to make sure the knees do not go forward of the toes.",
                    "quote": "making sure that your knees do not go forward of your toes"
                }
            ]
        }
    }
```

In [2]:
from datagen import aggregate_annotations

# saved to annotations.json
annotations = aggregate_annotations(config)
print('Total segments:', len(annotations))
annotations[0]

skipping gcNh17Ckjgg
Total segments: 22


{'start_timestamp': '00:00:20.479',
 'end_timestamp': '00:00:26.485',
 'segment_annotation': {'correct': None,
  'incorrect_reasons': None,
  'qa': [{'question': 'Was the exercise (squat) performed correctly?',
    'answer': 'Yes, the squat exercise was described correctly.',
    'quote': "let's learn how to properly perform a squat...cross your arms in front...shift your weight to the ball of your feet...bend your knees...push back up to the starting position."}]},
 'video_id': 'xqvCmoLULNY',
 'id': 'xqvCmoLULNY_0',
 'video_path': 'xqvCmoLULNY_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [3]:
from datagen import cut_videos
cut_videos(config=config)

100%|██████████| 22/22 [00:14<00:00,  1.55it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training