# Getting Started with Data Generation SDK

We are going to generate a dataset of squat videos with instructions how to perform them, so that we can train an AI pesonal trainer.

In [8]:
%load_ext autoreload
%autoreload 2

from datagen import DatagenConfig
# this config handles all the bookeeping so you need to pass it everywhere.
config = DatagenConfig.from_yaml('./config.yaml')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Get a list of search queries to search for videos

In [13]:
from datagen import get_queries
queries = get_queries(
    config=config,
    prompt='I want to find instructional videos about how to do squats.',
    num_queries=2
)
queries

['how to do squats instructions', 'squat exercise tutorial']

## Download video information for each query.

We'll get 2 videos for each query.<br>
One video might be found with multiple queries, so we might get less than `n_queries*videos_per_query` videos.<br>
If you want to get all youtube videos for a query, don't pass `videos_per_query` parameter.

You can limit the search to only videos licensed with Creative Commons (as indicated by youtube).<br>
As this search isn't directly implemented in searching libraries yet, we search for all videos and filter for license afterwards.<br>
Unfortunately, this way you will likely get very few results, so use with caution.

In [14]:
from datagen import get_video_ids
ids = get_video_ids(queries, config=config, videos_per_query=2, only_creative_commons=False)
ids

100%|██████████| 2/2 [00:01<00:00,  1.14it/s]


['xqvCmoLULNY', 'gcNh17Ckjgg', '4KmY44Xsg2w']

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [15]:
from datagen import download_videos
download_videos(['gcNh17Ckjgg', 'KvRK5Owqzgw', 'xqvCmoLULNY', 'YaXPRqUwItQ'], config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=YaXPRqUwItQ
[youtube] YaXPRqUwItQ: Downloading webpage
[youtube] YaXPRqUwItQ: Downloading ios player API JSON
[youtube] YaXPRqUwItQ: Downloading player d2e656ee


         n = U5NBYLtIN_DLGvfh ; player = https://www.youtube.com/s/player/d2e656ee/player_ias.vflset/en_US/base.js
         n = r2iA-HEoE7SU89EC ; player = https://www.youtube.com/s/player/d2e656ee/player_ias.vflset/en_US/base.js


[youtube] YaXPRqUwItQ: Downloading m3u8 information


## Detect segments from video and analyze them with gpt4o

In [None]:
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384").cuda()
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

config.json:   0%|          | 0.00/576 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.51G [00:00<?, ?B/s]

AssertionError: Torch not compiled with CUDA enabled

In [None]:
from datagen import detect_segments_clip

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

# This is the schema that we will extract from each detected segment.
# "doing_squats" will be used for filtering and "overlay_text" for annotation.

class SegmentInfo(BaseModel):
    '''Information about a segment'''
    doing_squats: bool = Field(description='Whether the person is doing squats. Only consider video of people, not renders or cartoons. If a person looks like they are preparing to do squats or standing between reps, consider them also doing squats if they are in a gym setting, wearing sportswear etc.')
    # overlay_text: str = Field(description='Overlay text that is superimprosed over the image, if present.')

detect_segments_clip(
    # segment_info_schema=SegmentInfo,
    # video_ids=['KvRK5Owqzgw'],
    text_prompts='a person doing squats',
    model=model,
    processor=processor,
    fps_sampling=2,
    device='cuda',
    config=config
)

For each video we get a list of segments:
```
[
    ...
    {
        "start_timestamp": "00:00:31.198",
        "end_timestamp": "00:00:36.003",
        "fps": 29.97002997002997,
        "segment_info": {
            "doing_squats": true,
            "overlay_text": "HIP-WIDTH APART"
        },
        "video_id": "gcNh17Ckjgg"
    },
    ...
]
```

## Annotate the segments from trascript + additional info

In [16]:
from datagen.annotate import generate_annotations, generate_clues
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

human_prompt = """User's instructions:
The initial video was a tutorial about how to perform squats. 
I need to restore what happened in specific *parts* of this video.

You'll find timecodes for the *parts* I'm interested in below. 

All *PARTS* CONTAIN A PERSON DOING SQUATS.

I NEED YOU TO DELIVER CLUES THAT WILL HELP ME RESTORE INFORMATION ABOUT HOW THIS PERSON PERFORMS SQUATS IN THIS SPECIFIC *PART*. 

!!!I need to restore data about HOW THIS PERSON PERFORMS SQUATS. 
What mistakes they make. What improvements they show. 
What they do correctly. What could be improved.!!!

Please, help me find relevant clues to reconstruct this information for each provided *part*.

Here is what I expect to have from you:
1. *Local clues* that could help me guess how a person in a *part* of the initial video performs squats  
2. *Global clues* that could help me guess how a person in a *part* of the initial video performs squats 
3. *Logical inferences* that could help me guess how a person in a *part* of the initial video performs squats 

!!!IT IS EXTREMELY IMPORTANT TO DELIVER ALL THREE THINGS!!!

CLUES: A *clue*, in the context of reconstructing narratives from damaged data, 
is a fragment of information extracted from a corrupted or incomplete source that provides 
insight into the original content. These fragments serve as starting points for inference 
and deduction, allowing researchers to hypothesize about the fuller context or meaning of 
the degraded material. The process of identifying and interpreting clues involves both objective analysis of the 
available data and subjective extrapolation based on domain knowledge, contextual understanding, 
and logical reasoning.

- LOCAL CLUES: THEY ARE LOCATED VERY CLOSE TO THE *PART* YOU ARE WORKING WITH REGARDING TIMESTAMPS
- GLOBAL CLUES: THEY ARE SCATTERED ACROSS THE ENTIRE TRANSCRIPT

LOGICAL INFERENCES: *Logical inference*, in the process of reconstructing narratives 
or information from damaged data, is the act of deriving plausible conclusions 
or filling in gaps based on available clues and contextual knowledge. This cognitive process 
involves applying deductive, inductive, or abductive reasoning to extrapolate beyond the explicit 
information provided by the damaged source. Logical inference relies on a combination of factual 
understanding, domain expertise, and analytical thinking to form connections between disparate 
pieces of information and generate coherent hypotheses about the missing or corrupted content. 
It often necessitates considering multiple possibilities, weighing probabilities, and making 
educated assumptions while maintaining awareness of potential biases or limitations in the 
reasoning process. The strength and validity of logical inferences can vary based on the quality  
and quantity of available clues, the complexity of the subject matter, and the inferrer's expertise,
making it both a powerful tool for information reconstruction and a process that requires careful 
scrutiny and validation.
"""


class LocalClue(BaseModel):
    '''
        Good local clues examples: [
      {
        "id": "LC1",
        "timestamp": "00:00:19",
        "quote": "exercises do them wrong and instead of",
        "clue": "This phrase introduces the concept of incorrect exercise form, setting the stage for a demonstration of improper technique."
      },
      {
        "id": "LC2",
        "timestamp": "00:00:21",
        "quote": "growing nice quads and glutes you'll",
        "clue": "Mentions the expected benefits of proper squats (muscle growth), implying that these benefits won't be achieved with incorrect form."
      },
      {
        "id": "LC3",
        "timestamp": "00:00:22",
        "quote": "feel aches and pains in your knees your",
        "clue": "Directly states negative consequences of improper form, strongly suggesting that this segment demonstrates incorrect technique."
      },
      {
        "id": "LC4",
        "timestamp": "00:00:24",
        "quote": "lower back and even your shoulders",
        "clue": "Continuation of LC3, emphasizing multiple areas of potential pain from improper form."
      },
      {
        "id": "LC5",
        "timestamp": "00:00:26",
        "quote": "let's see how to do it correctly",
        "clue": "This phrase suggests a transition is about to occur. The incorrect form has been shown, and correct form will follow."
      }
    ]
    '''
    id: str = Field(description='LC1,LC2...')
    timestamp: str = Field(description='the timestamp that is most probable for the clue')
    quote: str = Field(description='the quote from the transcript that was used to create this clue')
    clue: str = Field(description='the main clue data')
    
class GlobalClue(BaseModel):
    '''
    Good global clues examples: [
      {
        "id": "GC1",
        "timestamp": "00:01:15",
        "quote": "Before we dive into specific techniques, let's talk about safety.",
        "clue": "Introduces the theme of safety in squatting.",
        "relevance_to_segment": "This earlier emphasis on safety provides context for why proper depth is important and why it's being addressed in our segment. It connects to the fear of knee pain mentioned in LC3."
      },
      {
        "id": "GC2",
        "timestamp": "00:02:30",
        "quote": "Squatting is a fundamental movement pattern in everyday life.",
        "clue": "Emphasizes the importance of squats beyond just exercise.",
        "relevance_to_segment": "This broader context heightens the importance of learning proper squat depth as demonstrated in our segment. It suggests that the techniques shown have applications beyond just gym workouts."
      },
      {
        "clue_id": "GC3",
        "timestamp": "00:05:20",
        "quote": "If you have existing knee issues, consult a physician before attempting deep squats.",
        "clue": "Provides a health disclaimer related to squat depth.",
        "relevance_to_segment": "While this comes after our segment, it's relevant because it addresses the concern about knee pain mentioned in LC3. It suggests that the demonstration in our segment is generally safe but acknowledges individual variations."
      },
      {
        "clue_id": "GC4",
        "timestamp": "00:06:45",
        "quote": "Proper depth ensures full engagement of your quadriceps and glutes.",
        "clue": "Explains the benefit of correct squat depth.",
        "relevance_to_segment": "This later explanation provides justification for the depth guideline given in LC4. It helps viewers understand why the demonstrated technique is important."
      },
      {
        "clue_id": "GC5",
        "timestamp": "00:00:30",
        "quote": "Today, we'll cover squat variations for beginners to advanced lifters.",
        "clue": "Outlines the scope of the entire video.",
        "relevance_to_segment": "This early statement suggests that our segment, focusing on proper depth, is part of a comprehensive guide. It implies that the demonstration might be adaptable for different skill levels."
      }
    ]
    '''
    id: str = Field(description='GC1,GC2...')
    timestamp: str = Field(description='the timestamp that is most probable for the clue')
    quote: str = Field(description='the quote from the transcript that was used to create this clue')
    clue: str = Field(description='the main clue data')
    relevance_to_segment: str = Field(description='why do you think this global clue is relevant to the *part* you are working with right now')

class AdditionalInformation(BaseModel):
    '''
    Good logical inference examples:
    [
      {
        "id": "LI1",
        "description": "Primary Demonstration of Heel Lift",
        "details": "Given that GC1-GC3 describe the 'most common mistake' as heels lifting off the ground, and this description immediately precedes our segment, it's highly probable that this is the primary error being demonstrated. This is further supported by the segment's focus on incorrect form (LC1-LC4)."
      },
      {
        "id": "LI2",
        "description": "Multiple Error Demonstration",
        "details": "While heel lift is likely the primary focus, the mention of multiple pain points (knees, lower back, shoulders in LC3-LC4) suggests that the demonstrator may be exhibiting several forms of incorrect technique simultaneously. This comprehensive 'what not to do' approach would be pedagogically effective."
      },
      {
        "id": "LI3",
        "description": "Possible Inclusion of 'Butt Wink'",
        "details": "Although 'butt wink' is mentioned after our segment (GC4-GC6), its connection to back pain (which is mentioned in LC4) raises the possibility that this error is also present in the demonstration. The instructor may be showing multiple errors early on, then breaking them down individually later."
      },
      {
        "id": "LI4",
        "description": "Segment Placement in Overall Video Structure",
        "details": "The segment's position (starting at 00:00:19) and the phrase 'let's see how to do it correctly' (LC5) at the end suggest this is an early, foundational part of the video. It likely serves to grab attention by showing common mistakes before transitioning to proper form instruction."
      },
      {
        "id": "LI5",
        "description": "Intentional Exaggeration of Errors",
        "details": "Given the educational nature of the video, it's plausible that the demonstrator is intentionally exaggerating the incorrect form. This would make the errors more obvious to viewers and enhance the contrast with correct form shown later."
      }
    ]
    '''
    id: str = Field(description='LI1,LI2,...')
    description: str = Field(description='A concise form of the logical inference')
    details: str = Field(description='A verbose explanation of what insight about what happens in this *part* should be made based on delivered clues')

class SegmentAnnotation(BaseModel):
    local_clues: list[LocalClue] = Field(description='''Local clues are positioned very close to the *part* of the video in 
                                              terms of timestamps.''')
    global_clues: list[GlobalClue] = Field(description='''Global clues are scattered across the entire transcript. Be very carefull with
                                              them as it's very easy to accidentally assume wrong global clue because of limited attention or IQ. 
                                              ''')
    logical_inferences: list[AdditionalInformation] = Field(description='''What guess about how a person performs squats in this *part* can we make based on clues''')

# we will only take the segments where the "doing_squats" field is positive.
clues = generate_clues(
    config=config,
    annotation_schema=SegmentAnnotation,
    human_prompt=human_prompt,
    segments_per_call=5,
    raise_on_error=True
)

  0%|          | 0/4 [00:00<?, ?it/s]

xqvCmoLULNY - started
xqvCmoLULNY part 0 - started
xqvCmoLULNY part 1 - started


 25%|██▌       | 1/4 [00:20<01:02, 20.81s/it]

xqvCmoLULNY - done
gcNh17Ckjgg - started
gcNh17Ckjgg part 0 - started
gcNh17Ckjgg part 1 - started
gcNh17Ckjgg part 2 - started
gcNh17Ckjgg part 3 - started
gcNh17Ckjgg part 4 - started
gcNh17Ckjgg part 5 - started
gcNh17Ckjgg part 6 - started
gcNh17Ckjgg part 7 - started
gcNh17Ckjgg part 8 - started
gcNh17Ckjgg part 9 - started
gcNh17Ckjgg part 10 - started
gcNh17Ckjgg part 11 - started
gcNh17Ckjgg part 12 - started
gcNh17Ckjgg part 13 - started
gcNh17Ckjgg part 14 - started
gcNh17Ckjgg part 15 - started
gcNh17Ckjgg part 16 - started
gcNh17Ckjgg part 17 - started
gcNh17Ckjgg part 18 - started
gcNh17Ckjgg part 19 - started


 50%|█████     | 2/4 [08:50<10:16, 308.14s/it]

gcNh17Ckjgg - done
EbOPpWi4L8s - started
EbOPpWi4L8s part 0 - started
EbOPpWi4L8s part 1 - started
EbOPpWi4L8s part 2 - started


 75%|███████▌  | 3/4 [09:41<03:10, 190.93s/it]

EbOPpWi4L8s - done
HFnSsLIB7a4 - started
HFnSsLIB7a4 part 0 - started
HFnSsLIB7a4 part 1 - started
HFnSsLIB7a4 part 2 - started
HFnSsLIB7a4 part 3 - started
HFnSsLIB7a4 part 4 - started
HFnSsLIB7a4 part 5 - started
HFnSsLIB7a4 part 6 - started
HFnSsLIB7a4 part 7 - started
HFnSsLIB7a4 part 8 - started


100%|██████████| 4/4 [11:30<00:00, 172.60s/it]

HFnSsLIB7a4 - done





In [18]:
from datagen.annotate import generate_annotations, generate_clues
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This information that will be extracted for each segment from the transcript and data from the previous step.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

import inspect

human_prompt = '''
You are given a JSON object that contains clues about segments of a video with timecodes.
!!!! For each segment provided in a JSON object you need to answer on the following questions:
1. Given the data found in the JSON object, what is a probability that this part contains a footage of a person doing squats? [the answer could be only "High","Medium" or "Low"]
2. Given the data found in the JSON object and even if the answer on the previous question is "Low", does this person do squats right, wrong, or mixed? [the answer could be only "Right", "Wrong", and "Mixed"]
3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.
'''

class SegmentFeedback(BaseModel):
    '''
—> GOOD EXAMPLES:
    "wrong":"Knees caving in: This can stress the knees and reduce effectiveness"
    "correction":"Focus on keeping knees aligned with your toes."
    "wrong":"Rounding the back: This increases the risk of back injuries"
    "correction":"Keep your chest up and maintain a neutral spine throughout the movement."
    "wrong":"Heels are lifting off the ground: this shifts the weight forward, reducing stability"
    "correction":" Keep your weight on your heels and press through them as you rise."
    "right":"Chest and shoulders: The chest is up, and the shoulders are back, maintaining an upright torso."
    "correction":null
—> BAD EXAMPLES:
    "wrong":"knees"
    "correction":"fix knees"
    "wrong":"back looks funny"
    "correction":"make back better"
    "wrong":"feet are doing something"
    "correction":"feet should be different"
    "right":"arms"
    "correction":"arms are fine i think"
—> BAD EXAMPLES END HERE
    '''
    right: Optional[str] = Field(description='what was right in the performance')
    wrong: Optional[str] = Field(description='what was wrong in the performance')
    correction: Optional[str] = Field(description='how and in what ways it the performance could be improved')

# The segment timestamps are taken from the provided information.
class SegmentAnnotation(BaseModel):
    '''
Here is a JSON object that contains data about parts with timecodes of a video file that's called "How to do squats: rights and wrongs".
                !!!! Answer on the following questions:
                1. Given the data found in the JSON object, what is a propability that this part contains a footage of a person doing squats? [the answer could be only "High","Medium" or "Low"]
                2. Given the data found in the JSON object and even if the answer on the previous question is "Low", does this person do squats right, wrong, or mixed? [the answer could be only "Right", "Wrong", and "Mixed"]
                3. Given the data found in the JSON object, what exactly does thing person do right and/or wrong regarding their squats technique? [the answer should be clear and focused on body parts]
                4. If the answer on the previous question contains description of wrong technique, explain how to fix these mistakes using your "own knowledge" like you are a sports coach.

    '''
    squats_probability: Optional[str] = Field(description='how high is the probability that the person is doing squats in the segment: low, medium, high, unknown(null)')
    squats_technique_correctness: Optional[bool] = Field(description='bollean correctness of the squat technique.')
    squats_feedback: SegmentFeedback = Field(description='what was right and wrong in the squat perfomance in the segment. When the technique is incorrect, provide instructions how to correct them.')

# we will only take the segments where the "doing_squats" field is positive.
annotations = generate_annotations(
    # human_prompt=human_prompt,
    config=config,
    annotation_schema=SegmentAnnotation,
    # filter_by='doing_squats'
)

  0%|          | 0/4 [00:00<?, ?it/s]

HFnSsLIB7a4 - started


 25%|██▌       | 1/4 [00:41<02:03, 41.14s/it]

16 validation errors for VideoAnnotation
segments -> 0 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 1 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 2 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 3 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 4 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 5 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 6 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 7 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 8 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 9 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
segments -> 10 -> segment_annotation
  value is not a valid dict (type=type_error.dict)
s

 50%|█████     | 2/4 [00:57<00:53, 26.68s/it]

gcNh17Ckjgg - done
EbOPpWi4L8s - started


 75%|███████▌  | 3/4 [01:07<00:19, 19.18s/it]

EbOPpWi4L8s - done
xqvCmoLULNY - started


100%|██████████| 4/4 [01:17<00:00, 19.42s/it]

xqvCmoLULNY - done





Now we get a list of annotations for each video:
```
[
    {
        "start_timestamp": "00:00:51.760",
        "end_timestamp": "00:01:01.520",
        "segment_annotation": {
            "correct": null,
            "incorrect_reasons": null,
            "qa": [
                {
                    "question": "Was there important advice about performing the exercise correctly?",
                    "answer": "Yes, the advice was to make sure the knees do not go forward of the toes.",
                    "quote": "making sure that your knees do not go forward of your toes"
                }
            ]
        }
    }
```

In [None]:
from datagen import aggregate_annotations

# saved to annotations.json
annotations = aggregate_annotations(config)
print('Total segments:', len(annotations))
annotations[0]

skipping gcNh17Ckjgg
Total segments: 22


{'start_timestamp': '00:00:20.479',
 'end_timestamp': '00:00:26.485',
 'segment_annotation': {'correct': None,
  'incorrect_reasons': None,
  'qa': [{'question': 'Was the exercise (squat) performed correctly?',
    'answer': 'Yes, the squat exercise was described correctly.',
    'quote': "let's learn how to properly perform a squat...cross your arms in front...shift your weight to the ball of your feet...bend your knees...push back up to the starting position."}]},
 'video_id': 'xqvCmoLULNY',
 'id': 'xqvCmoLULNY_0',
 'video_path': 'xqvCmoLULNY_0.mp4'}

## The last step is to cut video clips for annotated segments from original videos

In [None]:
from datagen import cut_videos
cut_videos(config=config)

100%|██████████| 22/22 [00:14<00:00,  1.55it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training