# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [1]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig
# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig.from_yaml('./config.yaml')

## Get a list of search queries to search for videos

In [2]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats instructional video', 'squats exercise tutorial']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [3]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2, only_creative_commons=False)
ids

100%|██████████| 2/2 [00:02<00:00,  1.06s/it]


['YaXPRqUwItQ', 'xqvCmoLULNY', 'gcNh17Ckjgg']

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [2]:
from datagen import download_videos
download_videos(['gcNh17Ckjgg', 'KvRK5Owqzgw', 'xqvCmoLULNY', 'YaXPRqUwItQ'], config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=KvRK5Owqzgw
[youtube] KvRK5Owqzgw: Downloading webpage


[youtube] KvRK5Owqzgw: Downloading ios player API JSON
[youtube] KvRK5Owqzgw: Downloading player 250a2ff7


         n = 6GXUpjC9a5BZO9WmI7aZV2 ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = 7Q-BIXb6qWlDcOIHMJOyJB ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] KvRK5Owqzgw: Downloading m3u8 information
[info] KvRK5Owqzgw: Downloading subtitles: en
[info] KvRK5Owqzgw: Downloading 1 format(s): 614
[info] Writing video subtitles to: tmp/squats/videos/KvRK5Owqzgw.en.vtt
[download] Destination: tmp/squats/videos/KvRK5Owqzgw.en.vtt
[download] 100% of    6.76KiB in 00:00:00 at 87.55KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 9
[download] Destination: tmp/squats/videos/KvRK5Owqzgw.mp4
[download] 100% of    8.11MiB in 00:00:08 at 991.81KiB/s               
[MoveFiles] Moving file "tmp/squats/videos/KvRK5Owqzgw.en.vtt" to "tmp/squats/subs/KvRK5Owqzgw.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=xqvCmoLULNY
[youtube] xqvCmoLULNY: Downloading webpage
[youtube] xqvCmoLULNY: Downloading ios player API JSON


         n = Erw8W7qrJmuVdiP2cLm7d2 ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = cb9o2LU8eGoNkD4Oaf8DNd ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] xqvCmoLULNY: Downloading m3u8 information
[info] xqvCmoLULNY: Downloading subtitles: en
[info] xqvCmoLULNY: Downloading 1 format(s): 614
[info] Writing video subtitles to: tmp/squats/videos/xqvCmoLULNY.en.vtt
[download] Destination: tmp/squats/videos/xqvCmoLULNY.en.vtt
[download] 100% of    6.00KiB in 00:00:00 at 155.18KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 10
[download] Destination: tmp/squats/videos/xqvCmoLULNY.mp4
[download] 100% of    3.13MiB in 00:00:00 at 5.70MiB/s                   
[MoveFiles] Moving file "tmp/squats/videos/xqvCmoLULNY.en.vtt" to "tmp/squats/subs/xqvCmoLULNY.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=gcNh17Ckjgg
[youtube] gcNh17Ckjgg: Downloading webpage
[youtube] gcNh17Ckjgg: Downloading ios player API JSON


         n = AWd31sv8mDohRxttDppj1z ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = 7XjV9qziow8-RsKbLbg0ll ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] gcNh17Ckjgg: Downloading m3u8 information
[info] gcNh17Ckjgg: Downloading subtitles: en
[info] gcNh17Ckjgg: Downloading 1 format(s): 616
[info] Writing video subtitles to: tmp/squats/videos/gcNh17Ckjgg.en.vtt
[download] Destination: tmp/squats/videos/gcNh17Ckjgg.en.vtt
[download] 100% of   64.83KiB in 00:00:00 at 1.49MiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 86
[download] Destination: tmp/squats/videos/gcNh17Ckjgg.mp4
[download] 100% of  139.49MiB in 00:00:06 at 20.39MiB/s                 
[MoveFiles] Moving file "tmp/squats/videos/gcNh17Ckjgg.en.vtt" to "tmp/squats/subs/gcNh17Ckjgg.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=YaXPRqUwItQ
[youtube] YaXPRqUwItQ: Downloading webpage
[youtube] YaXPRqUwItQ: Downloading ios player API JSON


         n = BkD33Pi_ilvAYBvL4GsX99 ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js
         n = JAN_VnmXMBEoYPvReeoVve ; player = https://www.youtube.com/s/player/250a2ff7/player_ias.vflset/en_US/base.js


[youtube] YaXPRqUwItQ: Downloading m3u8 information
[info] YaXPRqUwItQ: Downloading subtitles: en
[info] YaXPRqUwItQ: Downloading 1 format(s): 616
[info] Writing video subtitles to: tmp/squats/videos/YaXPRqUwItQ.en.vtt
[download] Destination: tmp/squats/videos/YaXPRqUwItQ.en.vtt
[download] 100% of   15.69KiB in 00:00:00 at 423.55KiB/s
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 32
[download] Destination: tmp/squats/videos/YaXPRqUwItQ.mp4
[download] 100% of   28.07MiB in 00:00:03 at 7.34MiB/s                  
[MoveFiles] Moving file "tmp/squats/videos/YaXPRqUwItQ.en.vtt" to "tmp/squats/subs/YaXPRqUwItQ.en.vtt"


## Detect segments from video and analyze them with gpt4o

In [6]:
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").cuda()
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

In [None]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

# This is the schema that we will extract from each detected segment.
# "doing_squats" will be used for filtering and "overlay_text" for annotation.

class SegmentInfo(BaseModel):
    '''Information about a segment'''
    doing_squats: bool = Field(description='Whether the person is doing squats. Only consider video of people, not renders or cartoons. If a person looks like they are preparing to do squats or standing between reps, consider them also doing squats if they are in a gym setting, wearing sportswear etc.')
    # overlay_text: str = Field(description='Overlay text that is superimprosed over the image, if present.')

detect_segments_clip(
    # segment_info_schema=SegmentInfo,
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats',
    model=model,
    processor=processor,
    fps_sampling=2,
    device='cuda',
    config=config
)

For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:31.198",
        "end_timestamp": "00:00:36.003",
        "fps": 29.97002997002997,
        "segment_info": {
            "doing_squats": true,
            "overlay_text": "HIP-WIDTH APART"
        },
        "video_id": "gcNh17Ckjgg"
    },
    ...
]
```

## Annotate the segments from trascript + additional info

In [19]:
from datagen.annotate import generate_annotations, generate_clues
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

human_prompt = """User's instructions:
The initial video was a tutorial about how to perform squats. 
All *parts* below contain a video footage of a person doing squats. 
I need to find as much data as possible about HOW THIS PERSON PERFORMS SQUATS. 
I'm interested in how a person in a segment doings squats. What mistakes they make. What improvements they show. 
What they do correctly. What could be improved.
Please, help me find relevant clues.
"""

# The technique feedback should be in the form of a fitness instructor speaking to a trainee.
# The feedback should be exactly to the exercise performance shown in the segment and not to other segments.
# Good technique feedback:
# - Ensure your knees are in line with your toes and your back is straight while squatting.
# Bad technique feedback:
# - The video emphasizes the importance of following four easy steps to improve squat performance and muscle growth.

class Clue(BaseModel):
    '''
        Good local clues examples: [
      {
        "id": "LC1",
        "timestamp": "00:00:19",
        "text": "exercises do them wrong and instead of",
        "analysis": "This phrase introduces the concept of incorrect exercise form, setting the stage for a demonstration of improper technique."
      },
      {
        "id": "LC2",
        "timestamp": "00:00:21",
        "text": "growing nice quads and glutes you'll",
        "analysis": "Mentions the expected benefits of proper squats (muscle growth), implying that these benefits won't be achieved with incorrect form."
      },
      {
        "id": "LC3",
        "timestamp": "00:00:22",
        "text": "feel aches and pains in your knees your",
        "analysis": "Directly states negative consequences of improper form, strongly suggesting that this segment demonstrates incorrect technique."
      },
      {
        "id": "LC4",
        "timestamp": "00:00:24",
        "text": "lower back and even your shoulders",
        "analysis": "Continuation of LC3, emphasizing multiple areas of potential pain from improper form."
      },
      {
        "id": "LC5",
        "timestamp": "00:00:26",
        "text": "let's see how to do it correctly",
        "analysis": "This phrase suggests a transition is about to occur. The incorrect form has been shown, and correct form will follow."
      }
    ]
    Good global clues examples: [
      {
        "id": "GC1",
        "timestamp": "00:00:08",
        "text": "the most common mistake",
        "analysis": "Introduces the idea that a frequent error will be discussed. This sets up the expectation for a demonstration of this mistake."
      },
      {
        "id": "GC2",
        "timestamp": "00:00:10",
        "text": "is when your heels are",
        "analysis": "Begins to describe the specific error related to heel position."
      },
      {
        "id": "GC3",
        "timestamp": "00:00:12",
        "text": "in the air and not attached to the ground",
        "analysis": "Completes the description of the common mistake. This strongly suggests that the segment will demonstrate this specific error."
      },
      {
        "id": "GC4",
        "timestamp": "00:01:01",
        "text": "butt wink is a problem",
        "analysis": "Introduces another potential issue in squat form. While this comes after the segment, it might be relevant if the demonstration includes multiple errors."
      },
      {
        "id": "GC5",
        "timestamp": "00:01:03",
        "text": "it can lead to the back pain",
        "analysis": "Connects to LC3 and LC4, which mention back pain. This strengthens the possibility that 'butt wink' is also demonstrated in the segment."
      },
      {
        "id": "GC6",
        "timestamp": "00:01:06",
        "text": "so don't do that",
        "analysis": "Reinforces that the previously mentioned 'butt wink' is an error to be avoided, consistent with the segment's focus on incorrect form."
      }
    ]
    '''
    id: str = Field(description='LC1,LC2... for local clues, GC1,GC2... for global clues')
    timestamp: str = Field(description='mandatory for local and global clues, optional for logical inference or additional observations')
    text: str = Field(description='the text taken from the transcript')
    analysis: str = Field(description='interpretation of the text for improving squat techique')

class AdditionalInformation(BaseModel):
    '''
    Good logical inference examples:
    [
      {
        "id": "LI1",
        "description": "Primary Demonstration of Heel Lift",
        "details": "Given that GC1-GC3 describe the 'most common mistake' as heels lifting off the ground, and this description immediately precedes our segment, it's highly probable that this is the primary error being demonstrated. This is further supported by the segment's focus on incorrect form (LC1-LC4)."
      },
      {
        "id": "LI2",
        "description": "Multiple Error Demonstration",
        "details": "While heel lift is likely the primary focus, the mention of multiple pain points (knees, lower back, shoulders in LC3-LC4) suggests that the demonstrator may be exhibiting several forms of incorrect technique simultaneously. This comprehensive 'what not to do' approach would be pedagogically effective."
      },
      {
        "id": "LI3",
        "description": "Possible Inclusion of 'Butt Wink'",
        "details": "Although 'butt wink' is mentioned after our segment (GC4-GC6), its connection to back pain (which is mentioned in LC4) raises the possibility that this error is also present in the demonstration. The instructor may be showing multiple errors early on, then breaking them down individually later."
      },
      {
        "id": "LI4",
        "description": "Segment Placement in Overall Video Structure",
        "details": "The segment's position (starting at 00:00:19) and the phrase 'let's see how to do it correctly' (LC5) at the end suggest this is an early, foundational part of the video. It likely serves to grab attention by showing common mistakes before transitioning to proper form instruction."
      },
      {
        "id": "LI5",
        "description": "Intentional Exaggeration of Errors",
        "details": "Given the educational nature of the video, it's plausible that the demonstrator is intentionally exaggerating the incorrect form. This would make the errors more obvious to viewers and enhance the contrast with correct form shown later."
      }
    ]
    
    Good additional observations examples: [
      {
        "id": "AO1",
        "description": "Absence of Technical Terms",
        "details": "The transcript uses lay terms ('nice quads and glutes') rather than technical anatomical language. This suggests the video is targeted at a general audience rather than fitness professionals."
      },
      {
        "id": "AO2",
        "description": "Emphasis on Consequences",
        "details": "The immediate focus on negative outcomes (pain, lack of muscle growth) indicates a motivational approach, likely to encourage viewers to pay close attention to form."
      },
      {
        "id": "AO3",
        "description": "Potential Visual Cues",
        "details": "While we can't see the video, the specific mentions of body parts (heels, knees, lower back, shoulders) suggest there may be visual indicators or graphics highlighting these areas during the demonstration."
      },
      {
        "id": "AO4",
        "description": "Instructional Flow",
        "details": "The structure (common mistake → demonstration of errors → transition to correct form) follows a classic 'what not to do, then what to do' instructional pattern, which is effective for physical skills."
      }
    ]
    '''
    id: str = Field(description='LI1,LI2,... for logical inference, AO1,AO2,... for additional observations.')
    description: str = Field(description='A concise name of the information')
    details: str = Field(description='a more verbose description related to improving squat technique')

class SegmentAnnotation(BaseModel):
    local_clues: Optional[list[Clue]] = Field(description='Provide here all the clues about this time segment. Explain your logic in “If A then B” style. E.g., "Dan says Tony was doing squats right while Mary did it wrong, and according to the conversation the person in this segment is Tony". The clue is considered local if its located inside the segment or is overlapping with it. Be excessive, provide all the information you have found. Provide specific instructions from the transcript with timecodes.')
    global_clues: Optional[list[Clue]] = Field(description='Relevant clues are also scattered across the entire video. Provide here all the global clues about this time segment. "Global" means these clues were found across the entire video. E.g., the segment happens at 00:00:15 and the clue was found at 01:19:11. Explain your logic, especially why these clues are relevant to this particular segment. Be excessive, provide all the information you have found. Provide specific instructions from the transcript with timecodes.')
    # 
    # specifically the clues that you have extracted
    # techique_feedback: Optional[str] = Field(description='You are a fitness instructor. You are watching the squats performance of a person and saying them feedback on how good they are doing and what they can do better. The instructions for each segments should be self contained and not referencing other segments. If you dont have enough information to generate these instructions, do not say anything instead of saying that you dont know. Double check yourself to make sure that the feedback from the transcript corresponds to the timestamps of the segment.')
    logical_inferences: Optional[list[AdditionalInformation]] = Field(description='Build logical inferences for clues you found before. Use technical language. Be clear and consistent.')
    additional_observations: Optional[list[AdditionalInformation]] = Field(description='Any other observations that could help interpret the part of the video.')

    # instructions: Optional[str] = Field(description='After extracting clues, generate instructions using them. The instructions should be in the form of feedback to what is happening in the video - whether the squats are performed correctly, what is correct in the form and what could be improved. If no such instructions could be generated, skip the segment and do not output any instructions. They should be worded in the way a fitness coach would provide them. This text should read as though a coach is speaking, it should not contain anything that a person wouldnt say, eg "no instructions are provided here". If there is no clues or its impossible to generate instructions, skip the segment, and do not write anything.')
    # correct_technique: Optional[bool] = Field(description='based on the provided transcript, infer whether the person in the segment performs the exercise correctly or incorrectly')


    # clues_critic: Optional[str] = Field(description='Criticize all clues here.')
    # segment_data: Optional[str] = Field(description='Provide the final segment data here: details about this person body, mistakes and correct things about the squat, observations, recommendations, potential improvements, etc.')



# we will only take the segments where the "doing_squats" field is positive.
clues = generate_clues(
    config=config,
    annotation_schema=SegmentAnnotation,
    human_prompt=human_prompt,
    segments_per_call=5,
    raise_on_error=True
)

  0%|          | 0/4 [00:00<?, ?it/s]

gcNh17Ckjgg - started
gcNh17Ckjgg part 0 - started


gcNh17Ckjgg part 1 - started
gcNh17Ckjgg part 2 - started


 25%|██▌       | 1/4 [01:39<04:58, 99.43s/it]

gcNh17Ckjgg - done
YaXPRqUwItQ - started
YaXPRqUwItQ part 0 - started
YaXPRqUwItQ part 1 - started


 50%|█████     | 2/4 [02:05<01:52, 56.42s/it]

YaXPRqUwItQ - done
xqvCmoLULNY - started
xqvCmoLULNY part 0 - started


 75%|███████▌  | 3/4 [02:24<00:39, 39.39s/it]

xqvCmoLULNY - done
KvRK5Owqzgw - started
KvRK5Owqzgw part 0 - started


100%|██████████| 4/4 [02:32<00:00, 38.18s/it]

KvRK5Owqzgw - done





In [21]:
from datagen.annotate import generate_annotations, generate_clues
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

import inspect

human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "High","Medium" or "Low"]
2. Given the data found in the JSON object and even if the answer on the previous question is "Low", does this person do squats right, wrong, or mixed? [the answer could be only "Right", "Wrong", and "Mixed"]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

# class QA(BaseModel):
#     '''
#     Question and answer about a video segment.
#     Only write questions and answers about the correctness of the exercises or in which ways the performance in the video was wrong.
#     '''
#     question: str = Field(description='Question about the exercise performance in the video')
#     answer: str = Field(description='Answer about the exercise performance from a trainer.')
#     quote: str = Field(description='A direct and explicit quote from transcript or on-screen-text. The answer must be directly inferred from this quote.')

# '''
#   This is an example of a good analysis:

#   —> GOOD EXAMPLE STARTS HERE:

# {
# "start_timestamp":"00:00:00.000",
# "end_timestamp":"00:00:10.000",
# "squats_probability":"High",
# "squats_technique":"Mixed",
# "squats_content": [{"wrong":"Knees Caving In: This can stress the knees and reduce effectiveness", "correction":"Focus on keeping knees aligned with your toes."},
#                     {"wrong":"Rounding the Back: This increases the risk of back injuries", "correction":"Keep your chest up and maintain a neutral spine throughout the movement."},
#                     {"wrong":"Heels Lifting Off the Ground: This shifts the weight forward, reducing stability", "correction":" Keep your weight on your heels and press through them as you rise."},
#                     {"right":"Chest and Shoulders: The chest is up, and the shoulders are back, maintaining an upright torso.", "correction":"No need."}]
#                     }
#     —> GOOD EXAMPLE ENDS HERE

#     This is an example of a bad analysis:

#     —> BAD EXAMPLE STARTS HERE

# {
# "start_timestamp":"00:00:00",
# "end_timestamp":"00:10:00",
# "squats_probability":"maybe",
# "squats_technique":"okay-ish",
# "squats_content": [{"wrong":"knees", "correction":"fix knees"},
#                    {"wrong":"back looks funny", "correction":"make back better"},
#                    {"wrong":"feet are doing something", "correction":"feet should be different"},
#                    {"right":"arms", "correction":"arms are fine i think"}]
# }
#     —> BAD EXAMPLE ENDS HERE

# '''

class SegmentFeedback(BaseModel):
    '''
    You are a fitness trainer giving feedback on what was right, wrong, and what could be improved.
    Talk as you would talk to a trainee, but avoid excessive language or irrelevant banter.

—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    '''
    This annotation is generated exclusively from the provided information about this specific segment.
    Dont pay attention to information about other segments.
    '''
    squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[bool] = Field(description='bollean correctness of the squat technique.')
    squats_feedback: SegmentFeedback = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    # human_prompt=human_prompt,
    config=config,
    annotation_schema=SegmentAnnotation,
    # filter_by='doing_squats'
)

  0%|          | 0/4 [00:00<?, ?it/s]

gcNh17Ckjgg - started


 25%|██▌       | 1/4 [00:25<01:15, 25.31s/it]

gcNh17Ckjgg - done
YaXPRqUwItQ - started


 50%|█████     | 2/4 [00:35<00:32, 16.15s/it]

YaXPRqUwItQ - done
xqvCmoLULNY - started


 75%|███████▌  | 3/4 [00:37<00:09,  9.78s/it]

xqvCmoLULNY - done
KvRK5Owqzgw - started


100%|██████████| 4/4 [00:44<00:00, 11.16s/it]

KvRK5Owqzgw - done





Now we get a list of annotations for each video:
```
[
    {
        "start_timestamp": "00:00:51.760",
        "end_timestamp": "00:01:01.520",
        "segment_annotation": {
            "correct": null,
            "incorrect_reasons": null,
            "qa": [
                {
                    "question": "Was there important advice about performing the exercise correctly?",
                    "answer": "Yes, the advice was to make sure the knees do not go forward of the toes.",
                    "quote": "making sure that your knees do not go forward of your toes"
                }
            ]
        }
    }
```

In [2]:
from datagen import aggregate_annotations

# saved to annotations.json
annotations = aggregate_annotations(config)
print('Total segments:', len(annotations))
annotations[0]

skipping gcNh17Ckjgg
Total segments: 22


{'start_timestamp': '00:00:20.479',
 'end_timestamp': '00:00:26.485',
 'segment_annotation': {'correct': None,
  'incorrect_reasons': None,
  'qa': [{'question': 'Was the exercise (squat) performed correctly?',
    'answer': 'Yes, the squat exercise was described correctly.',
    'quote': "let's learn how to properly perform a squat...cross your arms in front...shift your weight to the ball of your feet...bend your knees...push back up to the starting position."}]},
 'video_id': 'xqvCmoLULNY',
 'id': 'xqvCmoLULNY_0',
 'video_path': 'xqvCmoLULNY_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [3]:
from datagen import cut_videos
cut_videos(config=config)

100%|██████████| 22/22 [00:14<00:00,  1.55it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training