# Data Generation SDK Example usage

We are going to generate a dataset of squat videos with instructions to train an LLM.

In [1]:
import datagen
from datagen.core.config import DatagenConfig
config = DatagenConfig.from_yaml('./config.yaml')

## Get a list of search queries to use

In [3]:
from datagen.search import get_queries, get_video_info
queries = get_queries(config=config, prompt='I want to find instructional videos about how to do squats.', num_queries=5)
print(len(queries))
queries

5


['how to do squats',
 'squat exercises for beginners',
 'proper squat form',
 'squat variations',
 'how to squat correctly']

## Download video information for each query.

There is a lot of useful information to filter the videos at this stage if necessary, but we will only use video ids later.<br>
Videos will be deduplicated so we don't need to download the same video multiple times.

In [4]:
df = get_video_info(queries, videos_per_query=10)
df.head()

  0%|          | 0/5 [00:00<?, ?it/s]

[youtube:search] Extracting URL: ytsearch10:how to do squats
[download] Downloading playlist: how to do squats
[youtube:search] query "how to do squats": Downloading web client config
[youtube:search] query "how to do squats" page 1: Downloading API JSON
[youtube:search] Playlist how to do squats: Downloading 10 items of 10
[download] Downloading item 1 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=IB_icWRzi4E
[youtube] IB_icWRzi4E: Downloading webpage
[youtube] IB_icWRzi4E: Downloading ios player API JSON
[youtube] IB_icWRzi4E: Downloading android player API JSON
[youtube] IB_icWRzi4E: Downloading m3u8 information




[download] Downloading item 2 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/gslEzVggur8
[youtube] gslEzVggur8: Downloading webpage
[youtube] gslEzVggur8: Downloading ios player API JSON
[youtube] gslEzVggur8: Downloading android player API JSON
[youtube] gslEzVggur8: Downloading m3u8 information




[download] Downloading item 3 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/PPmvh7gBTi0
[youtube] PPmvh7gBTi0: Downloading webpage
[youtube] PPmvh7gBTi0: Downloading ios player API JSON
[youtube] PPmvh7gBTi0: Downloading android player API JSON
[youtube] PPmvh7gBTi0: Downloading m3u8 information




[download] Downloading item 4 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/HgDZlNQrifY
[youtube] HgDZlNQrifY: Downloading webpage
[youtube] HgDZlNQrifY: Downloading ios player API JSON
[youtube] HgDZlNQrifY: Downloading android player API JSON
[youtube] HgDZlNQrifY: Downloading m3u8 information




[download] Downloading item 5 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/AIZ8q1qruKw
[youtube] AIZ8q1qruKw: Downloading webpage
[youtube] AIZ8q1qruKw: Downloading ios player API JSON
[youtube] AIZ8q1qruKw: Downloading android player API JSON
[youtube] AIZ8q1qruKw: Downloading m3u8 information




[download] Downloading item 6 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/cRxg-PUAT6I
[youtube] cRxg-PUAT6I: Downloading webpage
[youtube] cRxg-PUAT6I: Downloading ios player API JSON
[youtube] cRxg-PUAT6I: Downloading android player API JSON
[youtube] cRxg-PUAT6I: Downloading m3u8 information




[download] Downloading item 7 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/SLOkdLLWj8A
[youtube] SLOkdLLWj8A: Downloading webpage
[youtube] SLOkdLLWj8A: Downloading ios player API JSON
[youtube] SLOkdLLWj8A: Downloading android player API JSON
[youtube] SLOkdLLWj8A: Downloading m3u8 information




[download] Downloading item 8 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/TH6jSCGnowI
[youtube] TH6jSCGnowI: Downloading webpage
[youtube] TH6jSCGnowI: Downloading ios player API JSON
[youtube] TH6jSCGnowI: Downloading android player API JSON
[youtube] TH6jSCGnowI: Downloading m3u8 information




[download] Downloading item 9 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/iZTxa8NJH2g
[youtube] iZTxa8NJH2g: Downloading webpage
[youtube] iZTxa8NJH2g: Downloading ios player API JSON
[youtube] iZTxa8NJH2g: Downloading android player API JSON
[youtube] iZTxa8NJH2g: Downloading m3u8 information




[download] Downloading item 10 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/MLoZuAkIyZI
[youtube] MLoZuAkIyZI: Downloading webpage
[youtube] MLoZuAkIyZI: Downloading ios player API JSON
[youtube] MLoZuAkIyZI: Downloading android player API JSON
[youtube] MLoZuAkIyZI: Downloading m3u8 information




[download] Finished downloading playlist: how to do squats


 20%|██        | 1/5 [00:30<02:00, 30.01s/it]

[youtube:search] Extracting URL: ytsearch10:squat exercises for beginners
[download] Downloading playlist: squat exercises for beginners
[youtube:search] query "squat exercises for beginners": Downloading web client config
[youtube:search] query "squat exercises for beginners" page 1: Downloading API JSON
[youtube:search] Playlist squat exercises for beginners: Downloading 10 items of 10
[download] Downloading item 1 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=4KmY44Xsg2w
[youtube] 4KmY44Xsg2w: Downloading webpage
[youtube] 4KmY44Xsg2w: Downloading ios player API JSON
[youtube] 4KmY44Xsg2w: Downloading android player API JSON
[youtube] 4KmY44Xsg2w: Downloading m3u8 information




[download] Downloading item 2 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=EbOPpWi4L8s
[youtube] EbOPpWi4L8s: Downloading webpage
[youtube] EbOPpWi4L8s: Downloading ios player API JSON
[youtube] EbOPpWi4L8s: Downloading android player API JSON
[youtube] EbOPpWi4L8s: Downloading m3u8 information




[download] Downloading item 3 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/gslEzVggur8
[youtube] gslEzVggur8: Downloading webpage
[youtube] gslEzVggur8: Downloading ios player API JSON
[youtube] gslEzVggur8: Downloading android player API JSON
[youtube] gslEzVggur8: Downloading m3u8 information




[download] Downloading item 4 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/cRxg-PUAT6I
[youtube] cRxg-PUAT6I: Downloading webpage
[youtube] cRxg-PUAT6I: Downloading ios player API JSON
[youtube] cRxg-PUAT6I: Downloading android player API JSON
[youtube] cRxg-PUAT6I: Downloading m3u8 information




[download] Downloading item 5 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/HgDZlNQrifY
[youtube] HgDZlNQrifY: Downloading webpage
[youtube] HgDZlNQrifY: Downloading ios player API JSON
[youtube] HgDZlNQrifY: Downloading android player API JSON
[youtube] HgDZlNQrifY: Downloading m3u8 information




[download] Downloading item 6 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/iZTxa8NJH2g
[youtube] iZTxa8NJH2g: Downloading webpage
[youtube] iZTxa8NJH2g: Downloading ios player API JSON
[youtube] iZTxa8NJH2g: Downloading android player API JSON
[youtube] iZTxa8NJH2g: Downloading m3u8 information




[download] Downloading item 7 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/w8ZhgecdIAM
[youtube] w8ZhgecdIAM: Downloading webpage
[youtube] w8ZhgecdIAM: Downloading ios player API JSON
[youtube] w8ZhgecdIAM: Downloading android player API JSON
[youtube] w8ZhgecdIAM: Downloading m3u8 information




[download] Downloading item 8 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/MLoZuAkIyZI
[youtube] MLoZuAkIyZI: Downloading webpage
[youtube] MLoZuAkIyZI: Downloading ios player API JSON
[youtube] MLoZuAkIyZI: Downloading android player API JSON
[youtube] MLoZuAkIyZI: Downloading m3u8 information




[download] Downloading item 9 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/OTyb4YUDYYY
[youtube] OTyb4YUDYYY: Downloading webpage
[youtube] OTyb4YUDYYY: Downloading ios player API JSON
[youtube] OTyb4YUDYYY: Downloading android player API JSON
[youtube] OTyb4YUDYYY: Downloading m3u8 information




[download] Downloading item 10 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/T6id8FuUcao
[youtube] T6id8FuUcao: Downloading webpage
[youtube] T6id8FuUcao: Downloading ios player API JSON
[youtube] T6id8FuUcao: Downloading android player API JSON
[youtube] T6id8FuUcao: Downloading m3u8 information




[download] Finished downloading playlist: squat exercises for beginners


 40%|████      | 2/5 [00:58<01:28, 29.34s/it]

[youtube:search] Extracting URL: ytsearch10:proper squat form
[download] Downloading playlist: proper squat form
[youtube:search] query "proper squat form": Downloading web client config
[youtube:search] query "proper squat form" page 1: Downloading API JSON
[youtube:search] Playlist proper squat form: Downloading 10 items of 10
[download] Downloading item 1 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=byxWus7BwfQ
[youtube] byxWus7BwfQ: Downloading webpage
[youtube] byxWus7BwfQ: Downloading ios player API JSON
[youtube] byxWus7BwfQ: Downloading android player API JSON
[youtube] byxWus7BwfQ: Downloading m3u8 information




[download] Downloading item 2 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/gslEzVggur8
[youtube] gslEzVggur8: Downloading webpage
[youtube] gslEzVggur8: Downloading ios player API JSON
[youtube] gslEzVggur8: Downloading android player API JSON
[youtube] gslEzVggur8: Downloading m3u8 information




[download] Downloading item 3 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/PPmvh7gBTi0
[youtube] PPmvh7gBTi0: Downloading webpage
[youtube] PPmvh7gBTi0: Downloading ios player API JSON
[youtube] PPmvh7gBTi0: Downloading android player API JSON
[youtube] PPmvh7gBTi0: Downloading m3u8 information




[download] Downloading item 4 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/iZTxa8NJH2g
[youtube] iZTxa8NJH2g: Downloading webpage
[youtube] iZTxa8NJH2g: Downloading ios player API JSON
[youtube] iZTxa8NJH2g: Downloading android player API JSON
[youtube] iZTxa8NJH2g: Downloading m3u8 information




[download] Downloading item 5 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/HgDZlNQrifY
[youtube] HgDZlNQrifY: Downloading webpage
[youtube] HgDZlNQrifY: Downloading ios player API JSON
[youtube] HgDZlNQrifY: Downloading android player API JSON
[youtube] HgDZlNQrifY: Downloading m3u8 information




[download] Downloading item 6 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/EzvnMZuxGWw
[youtube] EzvnMZuxGWw: Downloading webpage
[youtube] EzvnMZuxGWw: Downloading ios player API JSON
[youtube] EzvnMZuxGWw: Downloading android player API JSON
[youtube] EzvnMZuxGWw: Downloading m3u8 information




[download] Downloading item 7 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/MM9ObaAPcv4
[youtube] MM9ObaAPcv4: Downloading webpage
[youtube] MM9ObaAPcv4: Downloading ios player API JSON
[youtube] MM9ObaAPcv4: Downloading android player API JSON
[youtube] MM9ObaAPcv4: Downloading m3u8 information




[download] Downloading item 8 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/cRxg-PUAT6I
[youtube] cRxg-PUAT6I: Downloading webpage
[youtube] cRxg-PUAT6I: Downloading ios player API JSON
[youtube] cRxg-PUAT6I: Downloading android player API JSON
[youtube] cRxg-PUAT6I: Downloading m3u8 information




[download] Downloading item 9 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/MLoZuAkIyZI
[youtube] MLoZuAkIyZI: Downloading webpage
[youtube] MLoZuAkIyZI: Downloading ios player API JSON
[youtube] MLoZuAkIyZI: Downloading android player API JSON
[youtube] MLoZuAkIyZI: Downloading m3u8 information




[download] Downloading item 10 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/TH6jSCGnowI
[youtube] TH6jSCGnowI: Downloading webpage
[youtube] TH6jSCGnowI: Downloading ios player API JSON
[youtube] TH6jSCGnowI: Downloading android player API JSON
[youtube] TH6jSCGnowI: Downloading m3u8 information




[download] Finished downloading playlist: proper squat form


 60%|██████    | 3/5 [01:29<01:00, 30.08s/it]

[youtube:search] Extracting URL: ytsearch10:squat variations
[download] Downloading playlist: squat variations
[youtube:search] query "squat variations": Downloading web client config
[youtube:search] query "squat variations" page 1: Downloading API JSON
[youtube:search] Playlist squat variations: Downloading 10 items of 10
[download] Downloading item 1 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=C73Y3EsJWIk
[youtube] C73Y3EsJWIk: Downloading webpage
[youtube] C73Y3EsJWIk: Downloading ios player API JSON
[youtube] C73Y3EsJWIk: Downloading android player API JSON
[youtube] C73Y3EsJWIk: Downloading m3u8 information




[download] Downloading item 2 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/LF4zb2SYWjQ
[youtube] LF4zb2SYWjQ: Downloading webpage
[youtube] LF4zb2SYWjQ: Downloading ios player API JSON
[youtube] LF4zb2SYWjQ: Downloading android player API JSON
[youtube] LF4zb2SYWjQ: Downloading m3u8 information




[download] Downloading item 3 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/LFkinX12jtU
[youtube] LFkinX12jtU: Downloading webpage
[youtube] LFkinX12jtU: Downloading ios player API JSON
[youtube] LFkinX12jtU: Downloading android player API JSON
[youtube] LFkinX12jtU: Downloading m3u8 information




[download] Downloading item 4 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/PPmvh7gBTi0
[youtube] PPmvh7gBTi0: Downloading webpage
[youtube] PPmvh7gBTi0: Downloading ios player API JSON
[youtube] PPmvh7gBTi0: Downloading android player API JSON
[youtube] PPmvh7gBTi0: Downloading m3u8 information




[download] Downloading item 5 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/MLoZuAkIyZI
[youtube] MLoZuAkIyZI: Downloading webpage
[youtube] MLoZuAkIyZI: Downloading ios player API JSON
[youtube] MLoZuAkIyZI: Downloading android player API JSON
[youtube] MLoZuAkIyZI: Downloading m3u8 information




[download] Downloading item 6 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/a3aw-5vDM2E
[youtube] a3aw-5vDM2E: Downloading webpage
[youtube] a3aw-5vDM2E: Downloading ios player API JSON
[youtube] a3aw-5vDM2E: Downloading android player API JSON
[youtube] a3aw-5vDM2E: Downloading m3u8 information




[download] Downloading item 7 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/PJj5shV4uYo
[youtube] PJj5shV4uYo: Downloading webpage
[youtube] PJj5shV4uYo: Downloading ios player API JSON
[youtube] PJj5shV4uYo: Downloading android player API JSON
[youtube] PJj5shV4uYo: Downloading m3u8 information




[download] Downloading item 8 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/xawAf5fXD2c
[youtube] xawAf5fXD2c: Downloading webpage
[youtube] xawAf5fXD2c: Downloading ios player API JSON
[youtube] xawAf5fXD2c: Downloading android player API JSON
[youtube] xawAf5fXD2c: Downloading m3u8 information




[download] Downloading item 9 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/jhb_nnV29EU
[youtube] jhb_nnV29EU: Downloading webpage
[youtube] jhb_nnV29EU: Downloading ios player API JSON
[youtube] jhb_nnV29EU: Downloading android player API JSON
[youtube] jhb_nnV29EU: Downloading m3u8 information




[download] Downloading item 10 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/EzvnMZuxGWw
[youtube] EzvnMZuxGWw: Downloading webpage
[youtube] EzvnMZuxGWw: Downloading ios player API JSON
[youtube] EzvnMZuxGWw: Downloading android player API JSON
[youtube] EzvnMZuxGWw: Downloading m3u8 information




[download] Finished downloading playlist: squat variations


 80%|████████  | 4/5 [01:59<00:30, 30.07s/it]

[youtube:search] Extracting URL: ytsearch10:how to squat correctly
[download] Downloading playlist: how to squat correctly
[youtube:search] query "how to squat correctly": Downloading web client config
[youtube:search] query "how to squat correctly" page 1: Downloading API JSON
[youtube:search] Playlist how to squat correctly: Downloading 10 items of 10
[download] Downloading item 1 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=byxWus7BwfQ
[youtube] byxWus7BwfQ: Downloading webpage
[youtube] byxWus7BwfQ: Downloading ios player API JSON
[youtube] byxWus7BwfQ: Downloading android player API JSON
[youtube] byxWus7BwfQ: Downloading m3u8 information




[download] Downloading item 2 of 10
[youtube] Extracting URL: https://www.youtube.com/watch?v=my0tLDaWyDU
[youtube] my0tLDaWyDU: Downloading webpage
[youtube] my0tLDaWyDU: Downloading ios player API JSON
[youtube] my0tLDaWyDU: Downloading android player API JSON
[youtube] my0tLDaWyDU: Downloading m3u8 information




[download] Downloading item 3 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/gslEzVggur8
[youtube] gslEzVggur8: Downloading webpage
[youtube] gslEzVggur8: Downloading ios player API JSON
[youtube] gslEzVggur8: Downloading android player API JSON
[youtube] gslEzVggur8: Downloading m3u8 information




[download] Downloading item 4 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/PPmvh7gBTi0
[youtube] PPmvh7gBTi0: Downloading webpage
[youtube] PPmvh7gBTi0: Downloading ios player API JSON
[youtube] PPmvh7gBTi0: Downloading android player API JSON
[youtube] PPmvh7gBTi0: Downloading m3u8 information




[download] Downloading item 5 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/HgDZlNQrifY
[youtube] HgDZlNQrifY: Downloading webpage
[youtube] HgDZlNQrifY: Downloading ios player API JSON
[youtube] HgDZlNQrifY: Downloading android player API JSON
[youtube] HgDZlNQrifY: Downloading m3u8 information




[download] Downloading item 6 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/iZTxa8NJH2g
[youtube] iZTxa8NJH2g: Downloading webpage
[youtube] iZTxa8NJH2g: Downloading ios player API JSON
[youtube] iZTxa8NJH2g: Downloading android player API JSON
[youtube] iZTxa8NJH2g: Downloading m3u8 information




[download] Downloading item 7 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/rhbIFJj4UYc
[youtube] rhbIFJj4UYc: Downloading webpage
[youtube] rhbIFJj4UYc: Downloading ios player API JSON
[youtube] rhbIFJj4UYc: Downloading android player API JSON
[youtube] rhbIFJj4UYc: Downloading m3u8 information




[download] Downloading item 8 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/SLOkdLLWj8A
[youtube] SLOkdLLWj8A: Downloading webpage
[youtube] SLOkdLLWj8A: Downloading ios player API JSON
[youtube] SLOkdLLWj8A: Downloading android player API JSON
[youtube] SLOkdLLWj8A: Downloading m3u8 information




[download] Downloading item 9 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/cRxg-PUAT6I
[youtube] cRxg-PUAT6I: Downloading webpage
[youtube] cRxg-PUAT6I: Downloading ios player API JSON
[youtube] cRxg-PUAT6I: Downloading android player API JSON
[youtube] cRxg-PUAT6I: Downloading m3u8 information




[download] Downloading item 10 of 10
[youtube] Extracting URL: https://www.youtube.com/shorts/HZilSL4ZNvQ
[youtube] HZilSL4ZNvQ: Downloading webpage
[youtube] HZilSL4ZNvQ: Downloading ios player API JSON
[youtube] HZilSL4ZNvQ: Downloading android player API JSON
[youtube] HZilSL4ZNvQ: Downloading m3u8 information




[download] Finished downloading playlist: how to squat correctly


100%|██████████| 5/5 [02:31<00:00, 30.21s/it]


Unnamed: 0_level_0,id,title,formats,thumbnails,thumbnail,description,channel_id,channel_url,duration,view_count,...,vbr,stretched_ratio,aspect_ratio,acodec,abr,asr,audio_channels,query,location,queries
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4KmY44Xsg2w,4KmY44Xsg2w,The Basic Squat - Balance Exercise - CORE Chir...,"[{'format_id': 'sb2', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/4KmY44Xsg2w/3...,https://i.ytimg.com/vi_webp/4KmY44Xsg2w/maxres...,Dr. Natalie Cordova demonstrates how to do a b...,UCW6EenBHb_KF-eaFRE3gXnA,https://www.youtube.com/channel/UCW6EenBHb_KF-...,173,223954,...,707.089,,1.78,opus,93.518,48000,2,squat exercises for beginners,CORE CHIROPRACTIC,[[squat exercises for beginners]]
AIZ8q1qruKw,AIZ8q1qruKw,How to Perform a PERFECT Squat,"[{'format_id': 'sb1', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/AIZ8q1qruKw/o...,https://i.ytimg.com/vi/AIZ8q1qruKw/sd2.jpg?sqp...,Get my book on fixing injury here: \nhttps://w...,UCyPYQTT20IgzVw92LDvtClw,https://www.youtube.com/channel/UCyPYQTT20IgzV...,59,1274974,...,649.69,,0.56,opus,108.711,48000,2,how to do squats,,[[how to do squats]]
C73Y3EsJWIk,C73Y3EsJWIk,Top 10 BEST SQUATS Variations,"[{'format_id': 'sb2', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/C73Y3EsJWIk/3...,https://i.ytimg.com/vi_webp/C73Y3EsJWIk/maxres...,Top 10 Best Squat Exercises:\n\nHigh Bar Squat...,UCKf0UqBiCQI4Ol0To9V0pKQ,https://www.youtube.com/channel/UCKf0UqBiCQI4O...,479,421253,...,1118.83,,1.78,mp4a.40.2,129.483,44100,2,squat variations,,[[squat variations]]
EbOPpWi4L8s,EbOPpWi4L8s,How to Do Squats for Beginners,"[{'format_id': 'sb3', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/EbOPpWi4L8s/3...,https://i.ytimg.com/vi/EbOPpWi4L8s/maxresdefau...,How to Do Squats for Beginners. Part of the se...,UCE8wCVw_ZfRw-D6RJ5EXWbw,https://www.youtube.com/channel/UCE8wCVw_ZfRw-...,97,207315,...,415.467,,1.78,opus,106.487,48000,2,squat exercises for beginners,,[[squat exercises for beginners]]
EzvnMZuxGWw,EzvnMZuxGWw,Perfect Squat Form in 3 Steps!,"[{'format_id': 'sb2', 'format_note': 'storyboa...",[{'url': 'https://i.ytimg.com/vi/EzvnMZuxGWw/3...,https://i.ytimg.com/vi/EzvnMZuxGWw/maxresdefau...,,UCyPYQTT20IgzVw92LDvtClw,https://www.youtube.com/channel/UCyPYQTT20IgzV...,60,569406,...,1299.342,,0.56,opus,127.322,48000,2,proper squat form,,"[[proper squat form], [squat variations]]"


In [13]:
# For 5 queries times 10 videos per query = 50 videos, we got 28 unique videos
len(df)

28

In [10]:
df.to_csv(config.data_dir / 'video_info.csv')

## Download videos and autogenerated subtitles

You can change sub languages, formats etc with `yt_dlp_opts` dictionary (refer to https://github.com/yt-dlp/yt-dlp).<br>
The SDK is expecting `.mp4` video files (for now), so don't change that.

In [15]:
from datagen.download_videos import download_videos
download_videos(df['id'], config)

[youtube] Extracting URL: https://www.youtube.com/watch?v=4KmY44Xsg2w
[youtube] 4KmY44Xsg2w: Downloading webpage
[youtube] 4KmY44Xsg2w: Downloading ios player API JSON
[youtube] 4KmY44Xsg2w: Downloading android player API JSON
[youtube] 4KmY44Xsg2w: Downloading m3u8 information




[info] 4KmY44Xsg2w: Downloading subtitles: en
[info] 4KmY44Xsg2w: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/4KmY44Xsg2w.en.vtt
[download] Destination: tmp/squats/videos/4KmY44Xsg2w.en.vtt
[download] 100% of   26.61KiB in 00:00:00 at 197.80KiB/s
[download] Destination: tmp/squats/videos/4KmY44Xsg2w.mp4
[download] 100% of    6.36MiB in 00:00:01 at 5.06MiB/s     
[MoveFiles] Moving file "tmp/squats/videos/4KmY44Xsg2w.en.vtt" to "tmp/squats/subs/4KmY44Xsg2w.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=AIZ8q1qruKw
[youtube] AIZ8q1qruKw: Downloading webpage
[youtube] AIZ8q1qruKw: Downloading ios player API JSON
[youtube] AIZ8q1qruKw: Downloading android player API JSON
[youtube] AIZ8q1qruKw: Downloading m3u8 information




[info] AIZ8q1qruKw: Downloading subtitles: en
[info] AIZ8q1qruKw: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/AIZ8q1qruKw.en.vtt
[download] Destination: tmp/squats/videos/AIZ8q1qruKw.en.vtt
[download] 100% of    8.53KiB in 00:00:00 at 30.42KiB/s
[download] Destination: tmp/squats/videos/AIZ8q1qruKw.mp4
[download] 100% of    1.96MiB in 00:00:01 at 1.95MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/AIZ8q1qruKw.en.vtt" to "tmp/squats/subs/AIZ8q1qruKw.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=C73Y3EsJWIk
[youtube] C73Y3EsJWIk: Downloading webpage
[youtube] C73Y3EsJWIk: Downloading ios player API JSON
[youtube] C73Y3EsJWIk: Downloading android player API JSON
[youtube] C73Y3EsJWIk: Downloading m3u8 information




[info] C73Y3EsJWIk: Downloading subtitles: en
[info] C73Y3EsJWIk: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/C73Y3EsJWIk.en.vtt
[download] Destination: tmp/squats/videos/C73Y3EsJWIk.en.vtt
[download] 100% of   87.20KiB in 00:00:00 at 316.26KiB/s
[download] Destination: tmp/squats/videos/C73Y3EsJWIk.mp4
[download] 100% of   22.69MiB in 00:00:09 at 2.39MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/C73Y3EsJWIk.en.vtt" to "tmp/squats/subs/C73Y3EsJWIk.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=EbOPpWi4L8s
[youtube] EbOPpWi4L8s: Downloading webpage
[youtube] EbOPpWi4L8s: Downloading ios player API JSON
[youtube] EbOPpWi4L8s: Downloading android player API JSON
[youtube] EbOPpWi4L8s: Downloading m3u8 information




[info] EbOPpWi4L8s: Downloading subtitles: en
[info] EbOPpWi4L8s: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/EbOPpWi4L8s.en.vtt
[download] Destination: tmp/squats/videos/EbOPpWi4L8s.en.vtt
[download] 100% of   12.98KiB in 00:00:00 at 76.77KiB/s
[download] Destination: tmp/squats/videos/EbOPpWi4L8s.mp4
[download] 100% of    4.81MiB in 00:00:01 at 4.30MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/EbOPpWi4L8s.en.vtt" to "tmp/squats/subs/EbOPpWi4L8s.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=EzvnMZuxGWw
[youtube] EzvnMZuxGWw: Downloading webpage
[youtube] EzvnMZuxGWw: Downloading ios player API JSON
[youtube] EzvnMZuxGWw: Downloading android player API JSON
[youtube] EzvnMZuxGWw: Downloading m3u8 information




[info] EzvnMZuxGWw: Downloading subtitles: en
[info] EzvnMZuxGWw: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/EzvnMZuxGWw.en.vtt
[download] Destination: tmp/squats/videos/EzvnMZuxGWw.en.vtt
[download] 100% of    8.85KiB in 00:00:00 at 48.76KiB/s
[download] Destination: tmp/squats/videos/EzvnMZuxGWw.mp4
[download] 100% of    2.86MiB in 00:00:01 at 2.29MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/EzvnMZuxGWw.en.vtt" to "tmp/squats/subs/EzvnMZuxGWw.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=HZilSL4ZNvQ
[youtube] HZilSL4ZNvQ: Downloading webpage
[youtube] HZilSL4ZNvQ: Downloading ios player API JSON
[youtube] HZilSL4ZNvQ: Downloading android player API JSON
[youtube] HZilSL4ZNvQ: Downloading m3u8 information




[info] HZilSL4ZNvQ: Downloading subtitles: en
[info] HZilSL4ZNvQ: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/HZilSL4ZNvQ.en.vtt
[download] Destination: tmp/squats/videos/HZilSL4ZNvQ.en.vtt
[download] 100% of    6.07KiB in 00:00:00 at 35.87KiB/s
[download] Destination: tmp/squats/videos/HZilSL4ZNvQ.mp4
[download] 100% of    1.19MiB in 00:00:00 at 1.35MiB/s     
[MoveFiles] Moving file "tmp/squats/videos/HZilSL4ZNvQ.en.vtt" to "tmp/squats/subs/HZilSL4ZNvQ.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=HgDZlNQrifY
[youtube] HgDZlNQrifY: Downloading webpage
[youtube] HgDZlNQrifY: Downloading ios player API JSON
[youtube] HgDZlNQrifY: Downloading android player API JSON
[youtube] HgDZlNQrifY: Downloading m3u8 information




[info] HgDZlNQrifY: Downloading subtitles: en
[info] HgDZlNQrifY: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/HgDZlNQrifY.en.vtt
[download] Destination: tmp/squats/videos/HgDZlNQrifY.en.vtt
[download] 100% of    5.28KiB in 00:00:00 at 28.22KiB/s
[download] Destination: tmp/squats/videos/HgDZlNQrifY.mp4
[download] 100% of  834.29KiB in 00:00:01 at 817.39KiB/s   
[MoveFiles] Moving file "tmp/squats/videos/HgDZlNQrifY.en.vtt" to "tmp/squats/subs/HgDZlNQrifY.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=IB_icWRzi4E
[youtube] IB_icWRzi4E: Downloading webpage
[youtube] IB_icWRzi4E: Downloading ios player API JSON
[youtube] IB_icWRzi4E: Downloading android player API JSON
[youtube] IB_icWRzi4E: Downloading m3u8 information




[info] IB_icWRzi4E: Downloading subtitles: en
[info] IB_icWRzi4E: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/IB_icWRzi4E.en.vtt
[download] Destination: tmp/squats/videos/IB_icWRzi4E.en.vtt
[download] 100% of   24.24KiB in 00:00:00 at 173.82KiB/s
[download] Destination: tmp/squats/videos/IB_icWRzi4E.mp4
[download] 100% of    4.95MiB in 00:00:01 at 4.16MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/IB_icWRzi4E.en.vtt" to "tmp/squats/subs/IB_icWRzi4E.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=LF4zb2SYWjQ
[youtube] LF4zb2SYWjQ: Downloading webpage
[youtube] LF4zb2SYWjQ: Downloading ios player API JSON
[youtube] LF4zb2SYWjQ: Downloading android player API JSON
[youtube] LF4zb2SYWjQ: Downloading m3u8 information




[info] LF4zb2SYWjQ: Downloading subtitles: en
[info] LF4zb2SYWjQ: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/LF4zb2SYWjQ.en.vtt
[download] Destination: tmp/squats/videos/LF4zb2SYWjQ.en.vtt
[download] 100% of    4.43KiB in 00:00:00 at 28.41KiB/s
[download] Destination: tmp/squats/videos/LF4zb2SYWjQ.mp4
[download] 100% of  778.67KiB in 00:00:01 at 706.89KiB/s   
[MoveFiles] Moving file "tmp/squats/videos/LF4zb2SYWjQ.en.vtt" to "tmp/squats/subs/LF4zb2SYWjQ.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=LFkinX12jtU
[youtube] LFkinX12jtU: Downloading webpage
[youtube] LFkinX12jtU: Downloading ios player API JSON
[youtube] LFkinX12jtU: Downloading android player API JSON
[youtube] LFkinX12jtU: Downloading m3u8 information




[info] LFkinX12jtU: Downloading subtitles: en
[info] LFkinX12jtU: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/LFkinX12jtU.en.vtt
[download] Destination: tmp/squats/videos/LFkinX12jtU.en.vtt
[download] 100% of    9.46KiB in 00:00:00 at 60.96KiB/s
[download] Destination: tmp/squats/videos/LFkinX12jtU.mp4
[download] 100% of    4.11MiB in 00:00:01 at 3.20MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/LFkinX12jtU.en.vtt" to "tmp/squats/subs/LFkinX12jtU.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=MLoZuAkIyZI
[youtube] MLoZuAkIyZI: Downloading webpage
[youtube] MLoZuAkIyZI: Downloading ios player API JSON
[youtube] MLoZuAkIyZI: Downloading android player API JSON
[youtube] MLoZuAkIyZI: Downloading m3u8 information




[info] MLoZuAkIyZI: Downloading subtitles: en
[info] MLoZuAkIyZI: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/MLoZuAkIyZI.en.vtt
[download] Destination: tmp/squats/videos/MLoZuAkIyZI.en.vtt
[download] 100% of   11.28KiB in 00:00:00 at 63.24KiB/s
[download] Destination: tmp/squats/videos/MLoZuAkIyZI.mp4
[download] 100% of    2.78MiB in 00:00:01 at 2.07MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/MLoZuAkIyZI.en.vtt" to "tmp/squats/subs/MLoZuAkIyZI.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=MM9ObaAPcv4
[youtube] MM9ObaAPcv4: Downloading webpage
[youtube] MM9ObaAPcv4: Downloading ios player API JSON
[youtube] MM9ObaAPcv4: Downloading android player API JSON
[youtube] MM9ObaAPcv4: Downloading m3u8 information




[info] MM9ObaAPcv4: Downloading subtitles: en
[info] MM9ObaAPcv4: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/MM9ObaAPcv4.en.vtt
[download] Destination: tmp/squats/videos/MM9ObaAPcv4.en.vtt
[download] 100% of    9.94KiB in 00:00:00 at 54.56KiB/s
[download] Destination: tmp/squats/videos/MM9ObaAPcv4.mp4
[download] 100% of    4.20MiB in 00:00:01 at 3.41MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/MM9ObaAPcv4.en.vtt" to "tmp/squats/subs/MM9ObaAPcv4.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=OTyb4YUDYYY
[youtube] OTyb4YUDYYY: Downloading webpage
[youtube] OTyb4YUDYYY: Downloading ios player API JSON
[youtube] OTyb4YUDYYY: Downloading android player API JSON
[youtube] OTyb4YUDYYY: Downloading m3u8 information




[info] OTyb4YUDYYY: Downloading subtitles: en
[info] OTyb4YUDYYY: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/OTyb4YUDYYY.en.vtt
[download] Destination: tmp/squats/videos/OTyb4YUDYYY.en.vtt
[download] 100% of    620.00B in 00:00:00 at 2.91KiB/s
[download] Destination: tmp/squats/videos/OTyb4YUDYYY.mp4
[download] 100% of    1.74MiB in 00:00:01 at 1.56MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/OTyb4YUDYYY.en.vtt" to "tmp/squats/subs/OTyb4YUDYYY.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=PJj5shV4uYo
[youtube] PJj5shV4uYo: Downloading webpage
[youtube] PJj5shV4uYo: Downloading ios player API JSON
[youtube] PJj5shV4uYo: Downloading android player API JSON
[youtube] PJj5shV4uYo: Downloading m3u8 information




[info] PJj5shV4uYo: Downloading subtitles: en
[info] PJj5shV4uYo: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/PJj5shV4uYo.en.vtt
[download] Destination: tmp/squats/videos/PJj5shV4uYo.en.vtt
[download] 100% of    656.00B in 00:00:00 at 3.61KiB/s
[download] Destination: tmp/squats/videos/PJj5shV4uYo.mp4
[download] 100% of    4.07MiB in 00:00:01 at 3.51MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/PJj5shV4uYo.en.vtt" to "tmp/squats/subs/PJj5shV4uYo.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=PPmvh7gBTi0
[youtube] PPmvh7gBTi0: Downloading webpage
[youtube] PPmvh7gBTi0: Downloading ios player API JSON
[youtube] PPmvh7gBTi0: Downloading android player API JSON
[youtube] PPmvh7gBTi0: Downloading m3u8 information




[info] PPmvh7gBTi0: Downloading subtitles: en
[info] PPmvh7gBTi0: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/PPmvh7gBTi0.en.vtt
[download] Destination: tmp/squats/videos/PPmvh7gBTi0.en.vtt
[download] 100% of   10.23KiB in 00:00:00 at 71.62KiB/s
[download] Destination: tmp/squats/videos/PPmvh7gBTi0.mp4
[download] 100% of    3.71MiB in 00:00:01 at 3.59MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/PPmvh7gBTi0.en.vtt" to "tmp/squats/subs/PPmvh7gBTi0.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=SLOkdLLWj8A
[youtube] SLOkdLLWj8A: Downloading webpage
[youtube] SLOkdLLWj8A: Downloading ios player API JSON
[youtube] SLOkdLLWj8A: Downloading android player API JSON
[youtube] SLOkdLLWj8A: Downloading m3u8 information




[info] SLOkdLLWj8A: Downloading 1 format(s): 18
[info] There are no subtitles for the requested languages
[download] Destination: tmp/squats/videos/SLOkdLLWj8A.mp4
[download] 100% of  627.24KiB in 00:00:00 at 788.73KiB/s 
[youtube] Extracting URL: https://www.youtube.com/watch?v=T6id8FuUcao
[youtube] T6id8FuUcao: Downloading webpage
[youtube] T6id8FuUcao: Downloading ios player API JSON
[youtube] T6id8FuUcao: Downloading android player API JSON
[youtube] T6id8FuUcao: Downloading m3u8 information




[info] T6id8FuUcao: Downloading 1 format(s): 18
[info] There are no subtitles for the requested languages
[download] Destination: tmp/squats/videos/T6id8FuUcao.mp4
[download] 100% of  683.30KiB in 00:00:00 at 814.87KiB/s   
[youtube] Extracting URL: https://www.youtube.com/watch?v=TH6jSCGnowI
[youtube] TH6jSCGnowI: Downloading webpage
[youtube] TH6jSCGnowI: Downloading ios player API JSON
[youtube] TH6jSCGnowI: Downloading android player API JSON
[youtube] TH6jSCGnowI: Downloading m3u8 information




[info] TH6jSCGnowI: Downloading subtitles: en
[info] TH6jSCGnowI: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/TH6jSCGnowI.en.vtt
[download] Destination: tmp/squats/videos/TH6jSCGnowI.en.vtt
[download] 100% of    6.34KiB in 00:00:00 at 33.87KiB/s
[download] Destination: tmp/squats/videos/TH6jSCGnowI.mp4
[download] 100% of    1.19MiB in 00:00:00 at 1.35MiB/s     
[MoveFiles] Moving file "tmp/squats/videos/TH6jSCGnowI.en.vtt" to "tmp/squats/subs/TH6jSCGnowI.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=a3aw-5vDM2E
[youtube] a3aw-5vDM2E: Downloading webpage
[youtube] a3aw-5vDM2E: Downloading ios player API JSON
[youtube] a3aw-5vDM2E: Downloading android player API JSON
[youtube] a3aw-5vDM2E: Downloading m3u8 information




[info] a3aw-5vDM2E: Downloading 1 format(s): 18
[info] There are no subtitles for the requested languages
[download] Destination: tmp/squats/videos/a3aw-5vDM2E.mp4
[download] 100% of    1.74MiB in 00:00:00 at 2.04MiB/s   
[youtube] Extracting URL: https://www.youtube.com/watch?v=byxWus7BwfQ
[youtube] byxWus7BwfQ: Downloading webpage
[youtube] byxWus7BwfQ: Downloading ios player API JSON
[youtube] byxWus7BwfQ: Downloading android player API JSON
[youtube] byxWus7BwfQ: Downloading m3u8 information




[info] byxWus7BwfQ: Downloading subtitles: en
[info] byxWus7BwfQ: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/byxWus7BwfQ.en.vtt
[download] Destination: tmp/squats/videos/byxWus7BwfQ.en.vtt
[download] 100% of   33.97KiB in 00:00:00 at 185.78KiB/s
[download] Destination: tmp/squats/videos/byxWus7BwfQ.mp4
[download] 100% of   11.12MiB in 00:00:02 at 4.27MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/byxWus7BwfQ.en.vtt" to "tmp/squats/subs/byxWus7BwfQ.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=cRxg-PUAT6I
[youtube] cRxg-PUAT6I: Downloading webpage
[youtube] cRxg-PUAT6I: Downloading ios player API JSON
[youtube] cRxg-PUAT6I: Downloading android player API JSON
[youtube] cRxg-PUAT6I: Downloading m3u8 information




[info] cRxg-PUAT6I: Downloading 1 format(s): 18
[info] There are no subtitles for the requested languages
[download] Destination: tmp/squats/videos/cRxg-PUAT6I.mp4
[download] 100% of  503.08KiB in 00:00:01 at 317.08KiB/s 
[youtube] Extracting URL: https://www.youtube.com/watch?v=gslEzVggur8
[youtube] gslEzVggur8: Downloading webpage
[youtube] gslEzVggur8: Downloading ios player API JSON
[youtube] gslEzVggur8: Downloading android player API JSON
[youtube] gslEzVggur8: Downloading m3u8 information




[info] gslEzVggur8: Downloading 1 format(s): 18
[info] There are no subtitles for the requested languages
[download] Destination: tmp/squats/videos/gslEzVggur8.mp4
[download] 100% of    1.53MiB in 00:00:01 at 813.11KiB/s   
[youtube] Extracting URL: https://www.youtube.com/watch?v=iZTxa8NJH2g
[youtube] iZTxa8NJH2g: Downloading webpage
[youtube] iZTxa8NJH2g: Downloading ios player API JSON
[youtube] iZTxa8NJH2g: Downloading android player API JSON
[youtube] iZTxa8NJH2g: Downloading m3u8 information




[info] iZTxa8NJH2g: Downloading subtitles: en
[info] iZTxa8NJH2g: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/iZTxa8NJH2g.en.vtt
[download] Destination: tmp/squats/videos/iZTxa8NJH2g.en.vtt
[download] 100% of    8.98KiB in 00:00:00 at 49.91KiB/s
[download] Destination: tmp/squats/videos/iZTxa8NJH2g.mp4
[download] 100% of    2.45MiB in 00:00:02 at 894.01KiB/s 
[MoveFiles] Moving file "tmp/squats/videos/iZTxa8NJH2g.en.vtt" to "tmp/squats/subs/iZTxa8NJH2g.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=jhb_nnV29EU
[youtube] jhb_nnV29EU: Downloading webpage
[youtube] jhb_nnV29EU: Downloading ios player API JSON
[youtube] jhb_nnV29EU: Downloading android player API JSON
[youtube] jhb_nnV29EU: Downloading m3u8 information




[info] jhb_nnV29EU: Downloading subtitles: en
[info] jhb_nnV29EU: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/jhb_nnV29EU.en.vtt
[download] Destination: tmp/squats/videos/jhb_nnV29EU.en.vtt
[download] 100% of    2.46KiB in 00:00:00 at 7.82KiB/s
[download] Destination: tmp/squats/videos/jhb_nnV29EU.mp4
[download] 100% of    2.09MiB in 00:00:01 at 1.49MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/jhb_nnV29EU.en.vtt" to "tmp/squats/subs/jhb_nnV29EU.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=my0tLDaWyDU
[youtube] my0tLDaWyDU: Downloading webpage
[youtube] my0tLDaWyDU: Downloading ios player API JSON
[youtube] my0tLDaWyDU: Downloading android player API JSON
[youtube] my0tLDaWyDU: Downloading m3u8 information




[info] my0tLDaWyDU: Downloading subtitles: en
[info] my0tLDaWyDU: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/my0tLDaWyDU.en.vtt
[download] Destination: tmp/squats/videos/my0tLDaWyDU.en.vtt
[download] 100% of   70.79KiB in 00:00:00 at 326.16KiB/s
[download] Destination: tmp/squats/videos/my0tLDaWyDU.mp4
[download] 100% of   16.91MiB in 00:00:03 at 5.60MiB/s     
[MoveFiles] Moving file "tmp/squats/videos/my0tLDaWyDU.en.vtt" to "tmp/squats/subs/my0tLDaWyDU.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=rhbIFJj4UYc
[youtube] rhbIFJj4UYc: Downloading webpage
[youtube] rhbIFJj4UYc: Downloading ios player API JSON
[youtube] rhbIFJj4UYc: Downloading android player API JSON
[youtube] rhbIFJj4UYc: Downloading m3u8 information




[info] rhbIFJj4UYc: Downloading subtitles: en
[info] rhbIFJj4UYc: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/rhbIFJj4UYc.en.vtt
[download] Destination: tmp/squats/videos/rhbIFJj4UYc.en.vtt
[download] 100% of   10.13KiB in 00:00:00 at 57.99KiB/s
[download] Destination: tmp/squats/videos/rhbIFJj4UYc.mp4
[download] 100% of    4.26MiB in 00:00:02 at 1.64MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/rhbIFJj4UYc.en.vtt" to "tmp/squats/subs/rhbIFJj4UYc.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=w8ZhgecdIAM
[youtube] w8ZhgecdIAM: Downloading webpage
[youtube] w8ZhgecdIAM: Downloading ios player API JSON
[youtube] w8ZhgecdIAM: Downloading android player API JSON
[youtube] w8ZhgecdIAM: Downloading m3u8 information




[info] w8ZhgecdIAM: Downloading subtitles: en
[info] w8ZhgecdIAM: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/w8ZhgecdIAM.en.vtt
[download] Destination: tmp/squats/videos/w8ZhgecdIAM.en.vtt
[download] 100% of    497.00B in 00:00:00 at 3.14KiB/s
[download] Destination: tmp/squats/videos/w8ZhgecdIAM.mp4
[download] 100% of    1.86MiB in 00:00:01 at 1.63MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/w8ZhgecdIAM.en.vtt" to "tmp/squats/subs/w8ZhgecdIAM.en.vtt"
[youtube] Extracting URL: https://www.youtube.com/watch?v=xawAf5fXD2c
[youtube] xawAf5fXD2c: Downloading webpage
[youtube] xawAf5fXD2c: Downloading ios player API JSON
[youtube] xawAf5fXD2c: Downloading android player API JSON
[youtube] xawAf5fXD2c: Downloading m3u8 information




[info] xawAf5fXD2c: Downloading subtitles: en
[info] xawAf5fXD2c: Downloading 1 format(s): 18
[info] Writing video subtitles to: tmp/squats/videos/xawAf5fXD2c.en.vtt
[download] Destination: tmp/squats/videos/xawAf5fXD2c.en.vtt
[download] 100% of    999.00B in 00:00:00 at 5.77KiB/s
[download] Destination: tmp/squats/videos/xawAf5fXD2c.mp4
[download] 100% of    2.05MiB in 00:00:01 at 1.77MiB/s   
[MoveFiles] Moving file "tmp/squats/videos/xawAf5fXD2c.en.vtt" to "tmp/squats/subs/xawAf5fXD2c.en.vtt"


## Detect segments from video and analyze them with gpt4o

In [5]:
from datagen.detect_segments import detect_segments

from typing import Optional
from langchain.pydantic_v1 import BaseModel, Field

# This is the schema that we will extract from each detected segment that will be also used during annotation.
# If you want the annotations to focus on the transcript, do not extract too much visual information here that might distract the LLM during annotation.
# We will use "doing_squats" for filtering,
# and overlay_text for annotation, although it might add noise.

class SegmentInfo(BaseModel):
    '''Information about a segment'''
    doing_squats: bool = Field(description='Whether the person is doing squats. Only consider video of people, not renders or cartoons. If a person looks like they are preparing to do squats or standing between reps, consider them also doing squats if they are in a gym setting, wearing sportswear etc.')
    overlay_text: str = Field(description='Overlay text that is superimprosed over the image, if present.')
    # clothes: str = Field(description='Clothes of the person doing squats in detail.')
    # image_description: str = Field(description="Describe the image in detail")


segments = []
for video in config.get_videos():
        print(video.stem)
        segments.append(detect_segments(
            video_id=video.stem,
            segment_info_schema=SegmentInfo,
            detection_algorithm=None, # default AdaptiveDetector algorithm is good for most types of video. https://www.scenedetect.com/docs/latest/api/detectors.html
            min_duration=1, max_duration=60, # discard too long or too short segments to save some GPT calls
            frames_per_segment=1, # how many frames per segment we will use for detection. More frames will be more accurate and capture more information (eg changing overlay text), but also longer and more expensive.
            config=config))


byxWus7BwfQ
OTyb4YUDYYY
w8ZhgecdIAM
T6id8FuUcao
a3aw-5vDM2E
AIZ8q1qruKw
C73Y3EsJWIk
EbOPpWi4L8s
SLOkdLLWj8A
EzvnMZuxGWw
PJj5shV4uYo
MM9ObaAPcv4
HgDZlNQrifY
LFkinX12jtU
LF4zb2SYWjQ
xawAf5fXD2c
MLoZuAkIyZI
TH6jSCGnowI
HZilSL4ZNvQ
gslEzVggur8
cRxg-PUAT6I
IB_icWRzi4E
my0tLDaWyDU
4KmY44Xsg2w
rhbIFJj4UYc
jhb_nnV29EU
iZTxa8NJH2g
PPmvh7gBTi0


In [7]:
# 28 videos took 14.5 minutes w/1 frame/segment
# about 30sec/video
!ls tmp/squats/segments | wc -l

      28


## Annotate the segments from trascript + additional info

In [8]:
segments = config.get_segments(info_type=SegmentInfo)
print(len(segments))
# we're only interested in segments where people do squats
segments = [x for x in segments if x.segment_info['doing_squats']]
print(len(segments))
segments[0]

357
227


Segment(start_timestamp='00:00:00.000', end_timestamp='00:00:02.800', fps=30.0, segment_info={'doing_squats': True, 'overlay_text': 'TOP THREE ONE LEGGED SQUATS'}, video_id='a3aw-5vDM2E')

In [9]:
from datagen.annotate import generate_annotations
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

# This schema will be detected for each segment.
# This is the most important part for the annotation, and getting good results requires a lot of experimenting.

class QA(BaseModel):
    '''
    Question and answer about a video segment.
    Only write questions and answers about the correctness of the exercises or in which ways the performance in the video was wrong.
    '''
    question: str = Field(description='Question about the exercise performance in the video')
    answer: str = Field(description='Answer about the exercise performance from a trainer.')
    quote: str = Field(description='A direct and explicit quote from transcript or on-screen-text. The answer must be directly inferred from this quote.')

    # Valid QA:
    # - Was the exercise performed correctly? -No.
    # - How can I improve my form? -Set your knees wider.
    # Invalid QA:
    # - Was the exercise perfomed correctly? -There is no information about that.
    # - What is written on screen? - The word Hello is written on screen.

class SegmentAnnotation(BaseModel):
    '''
    If there is information about whether the exercise was performed correctly or not, make a QA about it.
    If the exercise was performed incorrectly, make one or more QA about in which ways it was performed incorrectly.
    If it's not possible to infer whether the exercise was performed correctly, do not output a segment annotation.
    Do not output any other kinds of questions and answers.
    If no possible such QA could be generated from the explicit information in the transcript or on-screen text, do not output annotation for this segment.
    Output at most one annotation per segment.
    '''
    correct: Optional[bool] = Field(description='Whether the exercise was performed correctly. If there is no information about that, do not output this field.')
    incorrect_reasons: Optional[str] = Field(description='If the exercise was performed incorrectly, the reasons that were given about why was the performance was incorrect. If there is no information about that, do not output this field.')
    qa: list[QA]

annotations = generate_annotations(segments=segments, config=config, annotation_schema=SegmentAnnotation)


In [34]:
from datagen.annotate import aggregate_annotations

# here we filter the segments to only leave those that have correctness information and some QA
filter_func = lambda seg: seg['correct'] is not None and len(seg['qa'])
annotations = aggregate_annotations(config, filter_func=filter_func)
print('Total segments:', len(annotations))
annotations[0]

skipping AIZ8q1qruKw
skipping iZTxa8NJH2g
Total segments: 25


{'start_timestamp': '00:00:26.933',
 'end_timestamp': '00:00:28.467',
 'segment_annotation': {'correct': False,
  'incorrect_reasons': 'Incorrect stance',
  'qa': [{'question': 'Was the exercise performed correctly?',
    'answer': 'No',
    'quote': 'X, ✓'}]},
 'video_id': 'OTyb4YUDYYY',
 'id': 'OTyb4YUDYYY_0',
 'video_path': 'OTyb4YUDYYY_0.mp4'}

In [35]:
import json
with open(config.data_dir / 'annotations.json', 'w') as f:
    json.dump(annotations, f)

## The last step is to cut video clips for these segments from original videos

In [32]:
from datagen.cut_videos import cut_videos
cut_videos(annotations, config=config)

100%|██████████| 25/25 [00:22<00:00,  1.13it/s]


So as a result we generated:
- `<data_dir>/clips/` with video clips that you can use for training
- `<data_dir>/annotations.json` with list of items with fields:
    - video_id: 11-char youtube video id (youtube.com/watch?v=<id>)
    - start_timestamp/end_timestamp of the clip relative to the youtube video it's taken from
    - video_path of the clip relative to `<data_dir>/clips/`
    - segment_annotation that you can use for training