# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [1]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig
# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig.from_yaml('./config.yaml')

## Get a list of search queries to search for videos

In [2]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats instructional video', 'squats exercise tutorial']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [3]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2, only_creative_commons=False)
ids

100%|██████████| 2/2 [00:02<00:00,  1.06s/it]


['YaXPRqUwItQ', 'xqvCmoLULNY', 'gcNh17Ckjgg']

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [2]:
from datagen import download_videos
download_videos(['gcNh17Ckjgg', 'KvRK5Owqzgw', 'xqvCmoLULNY', 'YaXPRqUwItQ'], config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=KvRK5Owqzgw
[youtube] KvRK5Owqzgw: Downloading webpage


[youtube] KvRK5Owqzgw: Downloading ios player API JSON
[youtube] KvRK5Owqzgw: Downloading player 250a2ff7


         n = 6GXUpjC9a5BZO9WmI7aZV2 ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = 7Q-BIXb6qWlDcOIHMJOyJB ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] KvRK5Owqzgw: Downloading m3u8 information
[info] KvRK5Owqzgw: Downloading subtitles: en
[info] KvRK5Owqzgw: Downloading 1 format(s): 614
[info] Writing video subtitles to: tmp/squats/videos/KvRK5Owqzgw.en.vtt
[download] Destination: tmp/squats/videos/KvRK5Owqzgw.en.vtt
[download] 100% of    6.76KiB in 00:00:00 at 87.55KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 9
[download] Destination: tmp/squats/videos/KvRK5Owqzgw.mp4
[download] 100% of    8.11MiB in 00:00:08 at 991.81KiB/s               
[MoveFiles] Moving file "tmp/squats/videos/KvRK5Owqzgw.en.vtt" to "tmp/squats/subs/KvRK5Owqzgw.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=xqvCmoLULNY
[youtube] xqvCmoLULNY: Downloading webpage
[youtube] xqvCmoLULNY: Downloading ios player API JSON


         n = Erw8W7qrJmuVdiP2cLm7d2 ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = cb9o2LU8eGoNkD4Oaf8DNd ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] xqvCmoLULNY: Downloading m3u8 information
[info] xqvCmoLULNY: Downloading subtitles: en
[info] xqvCmoLULNY: Downloading 1 format(s): 614
[info] Writing video subtitles to: tmp/squats/videos/xqvCmoLULNY.en.vtt
[download] Destination: tmp/squats/videos/xqvCmoLULNY.en.vtt
[download] 100% of    6.00KiB in 00:00:00 at 155.18KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 10
[download] Destination: tmp/squats/videos/xqvCmoLULNY.mp4
[download] 100% of    3.13MiB in 00:00:00 at 5.70MiB/s                   
[MoveFiles] Moving file "tmp/squats/videos/xqvCmoLULNY.en.vtt" to "tmp/squats/subs/xqvCmoLULNY.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=gcNh17Ckjgg
[youtube] gcNh17Ckjgg: Downloading webpage
[youtube] gcNh17Ckjgg: Downloading ios player API JSON


         n = AWd31sv8mDohRxttDppj1z ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = 7XjV9qziow8-RsKbLbg0ll ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] gcNh17Ckjgg: Downloading m3u8 information
[info] gcNh17Ckjgg: Downloading subtitles: en
[info] gcNh17Ckjgg: Downloading 1 format(s): 616
[info] Writing video subtitles to: tmp/squats/videos/gcNh17Ckjgg.en.vtt
[download] Destination: tmp/squats/videos/gcNh17Ckjgg.en.vtt
[download] 100% of   64.83KiB in 00:00:00 at 1.49MiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 86
[download] Destination: tmp/squats/videos/gcNh17Ckjgg.mp4
[download] 100% of  139.49MiB in 00:00:06 at 20.39MiB/s                 
[MoveFiles] Moving file "tmp/squats/videos/gcNh17Ckjgg.en.vtt" to "tmp/squats/subs/gcNh17Ckjgg.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=YaXPRqUwItQ
[youtube] YaXPRqUwItQ: Downloading webpage
[youtube] YaXPRqUwItQ: Downloading ios player API JSON


         n = BkD33Pi_ilvAYBvL4GsX99 ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = JAN_VnmXMBEoYPvReeoVve ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] YaXPRqUwItQ: Downloading m3u8 information
[info] YaXPRqUwItQ: Downloading subtitles: en
[info] YaXPRqUwItQ: Downloading 1 format(s): 616
[info] Writing video subtitles to: tmp/squats/videos/YaXPRqUwItQ.en.vtt
[download] Destination: tmp/squats/videos/YaXPRqUwItQ.en.vtt
[download] 100% of   15.69KiB in 00:00:00 at 423.55KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 32
[download] Destination: tmp/squats/videos/YaXPRqUwItQ.mp4
[download] 100% of   28.07MiB in 00:00:03 at 7.34MiB/s                  
[MoveFiles] Moving file "tmp/squats/videos/YaXPRqUwItQ.en.vtt" to "tmp/squats/subs/YaXPRqUwItQ.en.vtt"


## Detect segments from video

In [6]:
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").cuda()
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

In [None]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

detect_segments_clip(
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats',
    model=model,
    processor=processor,
    fps_sampling=2,
    device='cuda',
    config=config
)

For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:32.500",
        "end_timestamp": "00:00:41.500",
        "fps": 29.97002997002997,
        "segment_info": null,
        "video_id": "KvRK5Owqzgw"
    },
    ...
]
```

## Annotaion step 1: extract information from transcript

In [2]:
from datagen.clues import generate_clues_dataclass
SegmentAnnotation = generate_clues_dataclass(prompt='making plov', config=config)

from pprint import pprint
pprint(SegmentAnnotation.schema())

pydantic.v1.main.SegmentAnnotation

In [11]:
from datagen import generate_clues

clues = generate_clues(
    config=config,
    annotation_schema=SegmentAnnotation,
    # human_prompt=human_prompt,
    segments_per_call=5,
    raise_on_error=True, # interrupt when encountering an error. Useful for debugging.
)

  0%|          | 0/1 [00:00<?, ?it/s]

KvRK5Owqzgw - started
KvRK5Owqzgw part 0 - started


100%|██████████| 1/1 [00:08<00:00,  8.07s/it]

KvRK5Owqzgw - done





## Annotaion step 2: extract information from transcript

In [4]:
from datagen import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

import inspect

human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "High","Medium" or "Low"]
2. Given the data found in the JSON object and even if the answer on the previous question is "Low", does this person do squats right, wrong, or mixed? [the answer could be only "Right", "Wrong", and "Mixed"]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

class SegmentFeedback(BaseModel):
    '''
    You are a fitness trainer giving feedback on what was right, wrong, and what could be improved.
    Talk as you would talk to a trainee, but avoid excessive language or irrelevant banter.

—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    '''
    This annotation is generated exclusively from the provided information about this specific segment.
    Dont pay attention to information about other segments.
    '''
    # squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[bool] = Field(description='bollean correctness of the squat technique.')
    squats_feedback: SegmentFeedback = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    human_prompt=human_prompt,
    config=config,
    annotation_schema=SegmentAnnotation,
    # filter_by='doing_squats'
)

  0%|          | 0/1 [00:00<?, ?it/s]

KvRK5Owqzgw - started


100%|██████████| 1/1 [00:03<00:00,  3.83s/it]

KvRK5Owqzgw - done





Now we get a list of annotations for each video:
```
[
    {
        "start_timestamp": "00:00:51.760",
        "end_timestamp": "00:01:01.520",
        "segment_annotation": {
            "correct": null,
            "incorrect_reasons": null,
            "qa": [
                {
                    "question": "Was there important advice about performing the exercise correctly?",
                    "answer": "Yes, the advice was to make sure the knees do not go forward of the toes.",
                    "quote": "making sure that your knees do not go forward of your toes"
                }
            ]
        }
    }
```

In [2]:
from datagen import aggregate_annotations

# saved to annotations.json
annotations = aggregate_annotations(config)
print('Total segments:', len(annotations))
annotations[0]

skipping gcNh17Ckjgg
Total segments: 22


{'start_timestamp': '00:00:20.479',
 'end_timestamp': '00:00:26.485',
 'segment_annotation': {'correct': None,
  'incorrect_reasons': None,
  'qa': [{'question': 'Was the exercise (squat) performed correctly?',
    'answer': 'Yes, the squat exercise was described correctly.',
    'quote': "let's learn how to properly perform a squat...cross your arms in front...shift your weight to the ball of your feet...bend your knees...push back up to the starting position."}]},
 'video_id': 'xqvCmoLULNY',
 'id': 'xqvCmoLULNY_0',
 'video_path': 'xqvCmoLULNY_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [3]:
from datagen import cut_videos
cut_videos(config=config)

100%|██████████| 22/22 [00:14<00:00,  1.55it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training