# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [1]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig

config_params = {
    'openai': {
        'type': 'azure', # openai/azure
        'temperature': '1',
        'deployment': 'gpt4o' # model for openai / deployment for azure
    },
    'data_dir': './tmp/squats'
}

!mkdir -p {config_params['data_dir']}

# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig(**config_params)

## Get a list of search queries to search for videos

In [7]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats instructional video', 'proper squat technique tutorial']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [8]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2,only_creative_commons=False)
ids

  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:02<00:00,  1.13s/it]


['gcNh17Ckjgg', 'EbOPpWi4L8s', 'bEv6CCg2BC8', 'xqvCmoLULNY']

In [2]:
import json
with open(config.data_dir / 'video_ids.json') as f:
    ids = json.load(f)

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [9]:
from datagen import download_videos
download_videos(ids, config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=gcNh17Ckjgg
[youtube] gcNh17Ckjgg: Downloading webpage


[youtube] gcNh17Ckjgg: Downloading ios player API JSON
[youtube] gcNh17Ckjgg: Downloading tv player API JSON
[youtube] gcNh17Ckjgg: Downloading player bd3293c9
[youtube] gcNh17Ckjgg: Downloading m3u8 information
[info] gcNh17Ckjgg: Downloading subtitles: en
[info] gcNh17Ckjgg: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/gcNh17Ckjgg.en.vtt
[download] Destination: tmp/squats/videos/gcNh17Ckjgg.en.vtt
[download] 100% of   64.83KiB in 00:00:00 at 1.25MiB/s
[download] Destination: tmp/squats/videos/gcNh17Ckjgg.mp4
[download] 100% of   14.15MiB in 00:00:00 at 27.23MiB/s    
[MoveFiles] Moving file "tmp/squats/videos/gcNh17Ckjgg.en.vtt" to "tmp/squats/subs/gcNh17Ckjgg.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=EbOPpWi4L8s
[youtube] EbOPpWi4L8s: Downloading webpage
[youtube] EbOPpWi4L8s: Downloading ios player API JSON
[youtube] EbOPpWi4L8s: Downloading tv player API JSON
[youtube] EbOPpWi4L8s: Downloading m3u8 information
[info] EbOP

## Detect segments from video

We will use the clip version because it's much faster than gpt4o, but we'll need a GPU.
You can also try using CPU for debugging

In [3]:
from transformers import AutoProcessor, AutoModel

# remove .cuda() for cpu
# SIGLIP outputs independent probs as opposed to CLIP that outputs multiclass probs
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").cuda()
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

In [5]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

detect_segments_clip(
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats', # that's the text for CLIP to compare to images. You can provide a list of texts to use average distance.
    model=model,
    processor=processor,
    fps_sampling=2, # the more fps, the more granular segment borders and more precise segments, at the cost of speed.
    device='cuda', # 'cpu' for local
    frames_per_batch=100, # 100 frames use about 10GB GPU RAM, so batch to fill your GPU RAM.
    config=config,

    # Parameters for segment detection from probabilities - these default values should work well, but if they produce bad results for specific kinds of videos, you can adjust them.
    min_prob=0.1, # minimum CLIP probability to consider the match
    max_gap_seconds=1, # gaps of prob < min_prob that could be inside segment
    min_segment_seconds=3, # discard very short segments
    smooth_fraction=0.02, # smoothing strength. Raw probabilities are smoothed to adapt to fluctuations between frames.
)

  0%|          | 0/4 [00:00<?, ?it/s]

2024-08-05 20:01:11.233330 EbOPpWi4L8s - starting - 193 frames
2024-08-05 20:01:13.141179 EbOPpWi4L8s - batch 0 - starting
2024-08-05 20:01:32.829624 EbOPpWi4L8s - batch 1 - starting


 25%|██▌       | 1/4 [00:39<01:59, 39.98s/it]

2024-08-05 20:01:51.128046 EbOPpWi4L8s - all batches done - detecting segments
2024-08-05 20:01:51.215352 gcNh17Ckjgg - starting - 870 frames
2024-08-05 20:02:02.076371 gcNh17Ckjgg - batch 0 - starting
2024-08-05 20:02:20.961139 gcNh17Ckjgg - batch 1 - starting
2024-08-05 20:02:39.755510 gcNh17Ckjgg - batch 2 - starting
2024-08-05 20:02:58.627160 gcNh17Ckjgg - batch 3 - starting
2024-08-05 20:03:17.502791 gcNh17Ckjgg - batch 4 - starting
2024-08-05 20:03:36.482962 gcNh17Ckjgg - batch 5 - starting
2024-08-05 20:03:55.420918 gcNh17Ckjgg - batch 6 - starting
2024-08-05 20:04:14.296455 gcNh17Ckjgg - batch 7 - starting
2024-08-05 20:04:33.144871 gcNh17Ckjgg - batch 8 - starting


 50%|█████     | 2/4 [03:35<03:59, 119.84s/it]

2024-08-05 20:04:46.854691 gcNh17Ckjgg - all batches done - detecting segments
2024-08-05 20:04:46.951287 bEv6CCg2BC8 - starting - 1318 frames
2024-08-05 20:04:59.902384 bEv6CCg2BC8 - batch 0 - starting
2024-08-05 20:05:18.849763 bEv6CCg2BC8 - batch 1 - starting
2024-08-05 20:05:37.729374 bEv6CCg2BC8 - batch 2 - starting
2024-08-05 20:05:56.567156 bEv6CCg2BC8 - batch 3 - starting
2024-08-05 20:06:15.475718 bEv6CCg2BC8 - batch 4 - starting
2024-08-05 20:06:34.447244 bEv6CCg2BC8 - batch 5 - starting
2024-08-05 20:06:53.330642 bEv6CCg2BC8 - batch 6 - starting
2024-08-05 20:07:12.193763 bEv6CCg2BC8 - batch 7 - starting
2024-08-05 20:07:31.124894 bEv6CCg2BC8 - batch 8 - starting
2024-08-05 20:07:50.040327 bEv6CCg2BC8 - batch 9 - starting
2024-08-05 20:08:08.981399 bEv6CCg2BC8 - batch 10 - starting
2024-08-05 20:08:27.879922 bEv6CCg2BC8 - batch 11 - starting
2024-08-05 20:08:46.790233 bEv6CCg2BC8 - batch 12 - starting
2024-08-05 20:09:05.719565 bEv6CCg2BC8 - batch 13 - starting


 75%|███████▌  | 3/4 [07:58<03:05, 185.01s/it]

2024-08-05 20:09:09.380070 bEv6CCg2BC8 - all batches done - detecting segments
2024-08-05 20:09:09.509757 xqvCmoLULNY - starting - 97 frames
2024-08-05 20:09:10.448225 xqvCmoLULNY - batch 0 - starting


100%|██████████| 4/4 [08:17<00:00, 124.40s/it]

2024-08-05 20:09:28.786903 xqvCmoLULNY - all batches done - detecting segments





For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:32.500",
        "end_timestamp": "00:00:41.500",
        "fps": 29.97002997002997,
        "segment_info": null, # not used with clip, but could be used with gpt4o
        "video_id": "KvRK5Owqzgw"
    },
    ...
]
```

## Annotaion step 1: extract information (clues) from transcript

In [6]:
from datagen import generate_clues

human_prompt = """
The provided video is a tutorial about how to perform squats. 

I need to understand HOW THE PERSON SHOWN IN EACH SEGMENT PERFORMS SQUATS IN THIS SEGMENT.
What is done correctly.
What mistakes they make. Why these mistakes happen.
How these mistakes could be improved.

It is very improtant that the information that you provide would describe how the person shown in the segment is doing squats, and not some generic advice that is unrelated to the visual information.
"""

clues = generate_clues(
    # video_ids=['byxWus7BwfQ'],
    config=config,
    human_prompt=human_prompt,
    segments_per_call=5, # the output might be quite long, so need to limit number of segments per gpt call to respect max output legnth
    raise_on_error=True, # interrupt when encountering an error. Useful for debugging.
)

  0%|          | 0/4 [00:00<?, ?it/s]

2024-08-05 20:10:34.537595 EbOPpWi4L8s - started
2024-08-05 20:10:34.538707 EbOPpWi4L8s part 0 - started


 25%|██▌       | 1/4 [00:30<01:31, 30.35s/it]

2024-08-05 20:11:04.888972 EbOPpWi4L8s - done
2024-08-05 20:11:04.889714 gcNh17Ckjgg - started
2024-08-05 20:11:04.890555 gcNh17Ckjgg part 0 - started
2024-08-05 20:11:42.035626 gcNh17Ckjgg part 1 - started
2024-08-05 20:12:32.563929 gcNh17Ckjgg part 2 - started


 50%|█████     | 2/4 [03:01<03:22, 101.24s/it]

2024-08-05 20:13:35.748808 gcNh17Ckjgg - done
2024-08-05 20:13:35.749525 bEv6CCg2BC8 - started
2024-08-05 20:13:35.750599 bEv6CCg2BC8 part 0 - started
2024-08-05 20:14:52.292449 bEv6CCg2BC8 part 1 - started
2024-08-05 20:15:33.855359 bEv6CCg2BC8 part 2 - started


 75%|███████▌  | 3/4 [05:37<02:06, 126.25s/it]

2024-08-05 20:16:11.757339 bEv6CCg2BC8 - done
2024-08-05 20:16:11.758068 xqvCmoLULNY - started
2024-08-05 20:16:11.758798 xqvCmoLULNY part 0 - started


100%|██████████| 4/4 [05:54<00:00, 88.54s/it] 

2024-08-05 20:16:28.678385 xqvCmoLULNY - done





## Annotaion step 2: extract information from transcript

In [7]:
from datagen import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.


human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "high", "medium", "low", or null (if impossible to infer from the provided data)]
2. Given the data found in the JSON object and even if the answer on the previous question is "low", does this person do squats right, wrong, or mixed? [the answer could be only "right", "wrong", "mixed", or null (if impossible to infer from the provided data)]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

class SegmentFeedback(BaseModel):
    '''
—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[str] = Field(description='correctness of the squat technique.')
    squats_feedback: Optional[SegmentFeedback] = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    human_prompt=human_prompt,
    config=config,
    segments_per_call=5,
    annotation_schema=SegmentAnnotation,
)

  0%|          | 0/4 [00:00<?, ?it/s]

2024-08-05 20:19:08.236552 EbOPpWi4L8s - started


 25%|██▌       | 1/4 [00:06<00:18,  6.17s/it]

2024-08-05 20:19:14.411218 EbOPpWi4L8s - done
2024-08-05 20:19:14.411934 gcNh17Ckjgg - started


 50%|█████     | 2/4 [00:48<00:55, 27.70s/it]

2024-08-05 20:19:57.170396 gcNh17Ckjgg - done
2024-08-05 20:19:57.171090 bEv6CCg2BC8 - started
2024-08-05 20:19:58.866695 5 validation errors for VideoAnnotation
segments -> 0 -> segment_annotation
  field required (type=value_error.missing)
segments -> 1 -> segment_annotation
  field required (type=value_error.missing)
segments -> 2 -> segment_annotation
  field required (type=value_error.missing)
segments -> 3 -> segment_annotation
  field required (type=value_error.missing)
segments -> 4 -> segment_annotation
  field required (type=value_error.missing)
2024-08-05 20:19:58.866845 Error while generating annotations for bEv6CCg2BC8 part 0, skipping


 75%|███████▌  | 3/4 [01:03<00:21, 21.49s/it]

2024-08-05 20:20:11.281963 bEv6CCg2BC8 - done
2024-08-05 20:20:11.282664 xqvCmoLULNY - started


100%|██████████| 4/4 [01:06<00:00, 16.59s/it]

2024-08-05 20:20:14.596331 xqvCmoLULNY - done





In [8]:
from datagen import aggregate_annotations

def filter_annotations(ann):
    if ann['squats_probability'] in [None, 'low', 'None', 'null']:
        # if we're not able to infer probability or prob is low, we don't need it
        return False
    if ann['squats_technique_correctness'] in [None, 'null', 'None']:
        # if we couldnt establish correctness at all, the feedback is probably useless
        return False
    if ann['squats_technique_correctness'] in ['mixed']:
        # discard empty segment if correctness isn't clear since there isn't any information to use for training
        if ann['squats_feedback'] is None:
            return False
        if set(ann['squats_feedback'].values()) == set([None]):
            return False
    return True

annotations = aggregate_annotations(config, filter_func=filter_annotations, annotation_file='annotations.json')
print('Total segments:', len(annotations))
annotations[0]

100%|██████████| 4/4 [00:00<00:00, 31775.03it/s]

Total segments: 10





{'end_timestamp': '00:00:13.250',
 'segment_annotation': {'squats_feedback': {'correction': None,
   'right': 'Use of chair for support: Good method for beginners to learn proper form without risk. Clear instructions on basic form steps: Ensures proper understanding of essential squat mechanics.',
   'wrong': None},
  'squats_probability': 'high',
  'squats_technique_correctness': 'correct'},
 'start_timestamp': '00:00:04.250',
 'video_id': 'EbOPpWi4L8s',
 'id': 'EbOPpWi4L8s_0',
 'video_path': 'EbOPpWi4L8s_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [9]:
from datagen import cut_videos
cut_videos(config=config)

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:27<00:00,  2.78s/it]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training