# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [45]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig

config_params = {
    'openai': {
        'type': 'azure', # openai/azure
        'temperature': '1',
        'deployment': 'gpt4o' # model for openai / deployment for azure
    },
    'data_dir': './tmp/squats'
}

# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig(**config_params)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Get a list of search queries to search for videos

In [4]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats',
 'squat exercise tutorial',
 'beginner guide to squats',
 'proper squat form',
 'squat workout video']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [5]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2, only_creative_commons=False)
ids

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:03<00:00,  1.26it/s]


['KJ8xAMJdZjQ',
 'ubdIGnX2Hfs',
 'YaXPRqUwItQ',
 'l83R5PblSMA',
 'irfw1gQ0foQ',
 'dCHLUtf--pg',
 'PPmvh7gBTi0',
 'EbOPpWi4L8s',
 '4KmY44Xsg2w',
 '3qkgrJNB6kA',
 'IB_icWRzi4E',
 'HFnSsLIB7a4',
 'xqvCmoLULNY',
 'LSj280OEKUI',
 'gcNh17Ckjgg',
 'p-R0HSfL6nw',
 'DGhHgiCfAb0',
 '_uZLFUnKSaM',
 'byxWus7BwfQ',
 'xuf1czJv-XI']

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [9]:
from datagen import download_videos
download_videos(ids, config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=ubdIGnX2Hfs
[youtube] ubdIGnX2Hfs: Downloading webpage




[youtube] ubdIGnX2Hfs: Downloading ios player API JSON
[youtube] ubdIGnX2Hfs: Downloading player d2e656ee
[youtube] ubdIGnX2Hfs: Downloading web player API JSON


         n = 4fJ2aQguR3YtedGA ; player = https://www.youtube.com/s/player/d2e656ee/player_ias.vflset/en_US/base.js
         n = 6nvrBUMEriQ9VtTj ; player = https://www.youtube.com/s/player/d2e656ee/player_ias.vflset/en_US/base.js


[youtube] ubdIGnX2Hfs: Downloading m3u8 information
[info] ubdIGnX2Hfs: Downloading subtitles: en
[info] ubdIGnX2Hfs: Downloading 1 format(s): 617
[info] Writing video subtitles to: tmp/squats3/videos/ubdIGnX2Hfs.en.vtt
[download] Destination: tmp/squats3/videos/ubdIGnX2Hfs.en.vtt
[download] 100% of   41.03KiB in 00:00:00 at 84.70KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 74
[download] Destination: tmp/squats3/videos/ubdIGnX2Hfs.mp4
[download] 100% of  126.46MiB in 00:00:42 at 3.00MiB/s                  
[MoveFiles] Moving file "tmp/squats3/videos/ubdIGnX2Hfs.en.vtt" to "tmp/squats3/subs/ubdIGnX2Hfs.en.vtt"


## Detect segments from video

We will use the clip version because it's much faster than gpt4o, but we'll need a GPU.
You can also try using CPU for debugging

In [2]:
from transformers import AutoProcessor, AutoModel

# remove .cuda() for cpu
# SIGLIP outputs independent probs as opposed to CLIP that outputs multiclass probs
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").cuda()
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

In [9]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

detect_segments_clip(
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats', # that's the text for CLIP to compare to images. You can provide a list of texts to use the average distance.
    model=model,
    processor=processor,
    fps_sampling=2, # the more fps, the more granular segment borders and more precise segments, at the cost of speed.
    device='cuda', # 'cpu' for local
    frames_per_batch=100, # 100 frames use about 10GB GPU RAM, so batch to fill your GPU RAM.
    config=config,

    # Parameters for segment detection from probabilities - these default values should work well, but if they produce bad results for specific kinds of videos, you can adjust them.
    min_prob=0.1, # minimum CLIP probability to consider the match
    max_gap_seconds=1, # gaps of prob < min_prob that could be inside segment
    min_segment_seconds=3, # discard very short segments
    smooth_fraction=0.02, # smoothing strength. Raw probabilities are smoothed to adapt to fluctuations between frames.
)

  0%|          | 0/13 [00:00<?, ?it/s]

HFnSsLIB7a4 - starting


  8%|▊         | 1/13 [00:41<08:18, 41.50s/it]

probs (743,) frames 743
ubdIGnX2Hfs - starting


 15%|█▌        | 2/13 [01:25<07:53, 43.05s/it]

probs (825,) frames 825
p-R0HSfL6nw - starting


 23%|██▎       | 3/13 [02:39<09:30, 57.01s/it]

probs (1372,) frames 1372
byxWus7BwfQ - starting


 31%|███       | 4/13 [03:00<06:27, 43.08s/it]

probs (393,) frames 393
EbOPpWi4L8s - starting


 38%|███▊      | 5/13 [03:11<04:10, 31.26s/it]

probs (193,) frames 193
KJ8xAMJdZjQ - starting


 46%|████▌     | 6/13 [03:18<02:41, 23.13s/it]

probs (133,) frames 133
dCHLUtf--pg - starting


 54%|█████▍    | 7/13 [04:03<03:00, 30.15s/it]

probs (810,) frames 810
l83R5PblSMA - starting


 62%|██████▏   | 8/13 [04:05<01:45, 21.13s/it]

probs (33,) frames 33
xuf1czJv-XI - starting


 69%|██████▉   | 9/13 [04:14<01:09, 17.40s/it]

probs (170,) frames 170
LSj280OEKUI - starting


 77%|███████▋  | 10/13 [05:29<01:46, 35.40s/it]

probs (1410,) frames 1410
irfw1gQ0foQ - starting


 85%|████████▍ | 11/13 [06:41<01:32, 46.31s/it]

probs (1284,) frames 1284
3qkgrJNB6kA - starting
probs (8513,) frames 8513


 92%|█████████▏| 12/13 [14:43<02:58, 178.98s/it]

DGhHgiCfAb0 - starting


100%|██████████| 13/13 [16:05<00:00, 74.25s/it] 

probs (1521,) frames 1521





For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:32.500",
        "end_timestamp": "00:00:41.500",
        "fps": 29.97002997002997,
        "segment_info": null, # not used with clip, but could be used with gpt4o
        "video_id": "KvRK5Owqzgw"
    },
    ...
]
```

## Annotaion step 1: extract information (clues) from transcript

In [7]:
from datagen import generate_clues

human_prompt = """
The provided video is a tutorial about how to perform squats. 

I need to understand HOW THE PERSON SHOWN IN EACH SEGMENT PERFORMS SQUATS IN THIS SEGMENT.
What is done correctly.
What mistakes they make. Why these mistakes happen.
How these mistakes could be improved.

It is very improtant that the information that you provide would describe how the person shown in the segment is doing squats, and not some generic advice that is unrelated to the visual information.
"""

clues = generate_clues(
    # video_ids=['byxWus7BwfQ'],
    config=config,
    human_prompt=human_prompt,
    segments_per_call=5, # the output might be quite long, so need to limit number of segments per gpt call to respect max output legnth
    raise_on_error=True, # interrupt when encountering an error. Useful for debugging.
)

  0%|          | 0/1 [00:00<?, ?it/s]

byxWus7BwfQ - started
byxWus7BwfQ part 0 - started


100%|██████████| 1/1 [00:25<00:00, 25.34s/it]

byxWus7BwfQ - done





In [1]:
from time import sleep
while True:
    clues = generate_clues(
        # video_ids=['byxWus7BwfQ'],
        config=config,
        human_prompt=human_prompt,
        segments_per_call=5, # the output might be quite long, so need to limit number of segments per gpt call to respect max output legnth
        raise_on_error=True, # interrupt when encountering an error. Useful for debugging.
    )
    sleep(60)

NameError: name 'generate_clues' is not defined

## Annotaion step 2: extract information from transcript

In [26]:
from datagen import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.


human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "high", "medium", "low", or null (if impossible to infer from the provided data)]
2. Given the data found in the JSON object and even if the answer on the previous question is "low", does this person do squats right, wrong, or mixed? [the answer could be only "right", "wrong", "mixed", or null (if impossible to infer from the provided data)]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

class SegmentFeedback(BaseModel):
    '''
—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    '''
Here is a JSON object that contains data about parts with timecodes of a video file where a person does squats.
!!!! Answer on the following questions:
1. Given the data found in the JSON object, what is a propability that this part contains a footage of a person doing squats? [the answer could be only "high","medium" or "low"]
2. Given the data found in the JSON object and even if the answer on the previous question is "low", does this person do squats correctly or not?
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
    '''
    squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[str] = Field(description='correctness of the squat technique.')
    squats_feedback: Optional[SegmentFeedback] = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    human_prompt=human_prompt,
    config=config,
    segments_per_call=5,
    annotation_schema=SegmentAnnotation,
)

  0%|          | 0/1 [00:00<?, ?it/s]

KJ8xAMJdZjQ - started


100%|██████████| 1/1 [00:07<00:00,  7.17s/it]

KJ8xAMJdZjQ - done





Now we get a list of annotations for each video:
```
[
    {
        "start_timestamp": "00:00:51.760",
        "end_timestamp": "00:01:01.520",
        "segment_annotation": {
            "correct": null,
            "incorrect_reasons": null,
            "qa": [
                {
                    "question": "Was there important advice about performing the exercise correctly?",
                    "answer": "Yes, the advice was to make sure the knees do not go forward of the toes.",
                    "quote": "making sure that your knees do not go forward of your toes"
                }
            ]
        }
    }
```

In [47]:
from datagen import aggregate_annotations

# saved to annotations.json

def filter_annotations(ann):
    if ann['squats_probability'] in [None, 'low', 'None', 'null']:
        # if we're not able to infer probability or prob is low, we don't need it
        return False
    if ann['squats_technique_correctness'] in [None, 'null', 'None']:
        # if we couldnt establish correctness at all, the feedback is probably useless
        return False
    if ann['squats_technique_correctness'] in ['mixed']:
        # discard empty segment if correctness isn't clear since there isn't any information to use for training
        if ann['squats_feedback'] is None:
            return False
        if set(ann['squats_feedback'].values()) == set([None]):
            return False
    return True

annotations = aggregate_annotations(config, filter_func=filter_annotations, annotation_file='annotations.json')
print('Total segments:', len(annotations))
annotations[0]

100%|██████████| 20/20 [00:00<00:00, 88208.29it/s]

Total segments: 31





{'start_timestamp': '00:01:20.250',
 'end_timestamp': '00:01:25.250',
 'segment_annotation': {'squats_probability': 'medium',
  'squats_technique_correctness': 'mixed',
  'squats_feedback': {'right': 'Correct knee alignment: Not letting the knees go past a specific line.',
   'wrong': 'Potential common mistakes as seen in weight rooms and competitions.',
   'correction': 'Focus on proper knee alignment to prevent potential damage: Ensure your knees do not pass a plumb line drawn from your toes during the squat.'}},
 'video_id': 'byxWus7BwfQ',
 'id': 'byxWus7BwfQ_0',
 'video_path': 'byxWus7BwfQ_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [3]:
from datagen import cut_videos
cut_videos(config=config)

100%|██████████| 22/22 [00:14<00:00,  1.55it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training