# YouTube links
Scraping the Udacity lesson pages and extracting the YouTube links is an exercise left for the reader.

In [1]:
import pandas as pd
ytdf = pd.read_json('..\\Scraper\\youtube3.json')

We use the [`youtube-dl`](https://github.com/ytdl-org/youtube-dl) package to download the subtitles from the videos to vtt files. The specific parameters for the downloader utility are as follows:
```
  skip_download = True                                   # Prevents downloading of the video itself
  outtmpl = f'ytsubs\\{lesson}_{concept}_%(title)s.vtt'  # Output file format including folder
  subtitlelangs = 'en'                                   # English subtitles
  writesubs = True                                       # Grab regular subtitles
  writeautosubs = True                                   # Grab autogenerated subtitles
```

In [2]:
import youtube_dl as ydl
def download_subs(video_code, lang='en', auto=False):
    root_url = 'http://www.youtube.com/watch/?v='
    output = f'ytsubs\\%(title)s_{video_code}.vtt'
    yt_opts = {
        'skip_download': True,
        'outtmpl': output,
        'subtitlelangs': lang
    }
    yt_opts.update({'writeautomaticsub':True} if auto else {'writesubtitles':True})
    with ydl.YoutubeDL(yt_opts) as yt:
        yt.download([root_url+video_code])

## Add columns
To the `ytdf` dataframe we want to add the lesson number and the concept number so that we can identify the source.

In [3]:
ytdf['lesson_id'] = ytdf.lesson_name.map(
    {name:id for (name,id) in list(zip(ytdf.lesson_name.unique(),range(ytdf.lesson_name.nunique())))}
)
ytdf['concept_id'] = ytdf.concept_name.map(lambda x: x[:x.find('.')].strip())

In [4]:
ytdf.to_pickle('ytdf.pkl')

## Execute the downloads
This takes a couple of seconds per video. On my machine this took 2-3 minutes.

In [5]:
for code in ytdf.video_url: download_subs(code)

[youtube] 9QadFJRKrEA: Downloading webpage
[info] Writing video subtitles to: ytsubs\Welcome To Udacity V2_9QadFJRKrEA.vtt.en.vtt
[youtube] hwtrw64xQmQ: Downloading webpage




[youtube] C25mkY1R5Wc: Downloading webpage
[info] Writing video subtitles to: ytsubs\0 00 11983 Udacity ML Course Lesson0 FINAL V3_C25mkY1R5Wc.vtt.en.vtt
[youtube] V__T6TEXobA: Downloading webpage
[info] Writing video subtitles to: ytsubs\1 00 11983 Udacity ML Course Lesson1 FINAL V2_V__T6TEXobA.vtt.en.vtt
[youtube] 567-mVblChI: Downloading webpage
[info] Writing video subtitles to: ytsubs\1 01 11983 Udacity ML Course Lesson1 FINAL V2_567-mVblChI.vtt.en.vtt
[youtube] IFqpMGZRaGc: Downloading webpage
[info] Writing video subtitles to: ytsubs\1 03 11983 Udacity ML Course Lesson1 FINAL V2_IFqpMGZRaGc.vtt.en.vtt
[youtube] cy6RinIoteM: Downloading webpage
[info] Writing video subtitles to: ytsubs\1 04 11983 Udacity ML Course Lesson1 FINAL V2_cy6RinIoteM.vtt.en.vtt
[youtube] xX66AYbEJJY: Downloading webpage
[info] Writing video subtitles to: ytsubs\1 05 11983 Udacity ML Course Lesson1 FINAL V2_xX66AYbEJJY.vtt.en.vtt
[youtube] 11Hcp1ts494: Downloading webpage
[info] Writing video subtitles to

[info] Writing video subtitles to: ytsubs\2 17 11983 Udacity ML Course FINAL V2_qpmlvrQWZ6U.vtt.en.vtt
[youtube] c_8sEPf0Cg0: Downloading webpage
[info] Writing video subtitles to: ytsubs\2 18 11983 Udacity ML Course FINAL V2_c_8sEPf0Cg0.vtt.en.vtt
[youtube] sG76ZVt0tS4: Downloading webpage
[info] Writing video subtitles to: ytsubs\2 19 B 11983 Udacity ML Course FINAL V2_sG76ZVt0tS4.vtt.en.vtt
[youtube] rwvFVe3CbXs: Downloading webpage
[info] Writing video subtitles to: ytsubs\2 19 A 11983 Udacity ML Course FINAL V2_rwvFVe3CbXs.vtt.en.vtt
[youtube] cEwPyC6pvAY: Downloading webpage
[info] Writing video subtitles to: ytsubs\2 20 11983 Udacity ML Course FINAL V2_cEwPyC6pvAY.vtt.en.vtt
[youtube] P8XkYitGtXE: Downloading webpage
[info] Writing video subtitles to: ytsubs\2 21 LAB 11983 Udacity ML Course FINAL V2_P8XkYitGtXE.vtt.en.vtt
[youtube] msaXELuThsY: Downloading webpage
[info] Writing video subtitles to: ytsubs\2 25 11983 Udacity ML Course FINAL V2_msaXELuThsY.vtt.en.vtt
[youtube] 40i



[youtube] lXDOH8B0Gs4: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 32 11983 Udacity ML Course Lesson3 FINAL V2_lXDOH8B0Gs4.vtt.en.vtt
[youtube] SR8akpb0zpE: Downloading webpage




[youtube] QewAKJzDAeg: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 33 11983 Udacity ML Course Lesson3 FINAL V2_QewAKJzDAeg.vtt.en.vtt
[youtube] 2PGHJJARRa4: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 34 11983 Udacity ML Course Lesson3 FINAL V2_2PGHJJARRa4.vtt.en.vtt
[youtube] pf9ccUR8fJg: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 08 11983 Udacity ML Course Lesson3 FINAL V2_pf9ccUR8fJg.vtt.en.vtt
[youtube] VfIJGRbKptQ: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 07 11983 Udacity ML Course Lesson3 FINAL V2_VfIJGRbKptQ.vtt.en.vtt
[youtube] fH5GNou7nj4: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 36 11983 Udacity ML Course Lesson3 FINAL V2_fH5GNou7nj4.vtt.en.vtt
[youtube] g1i5ErYFaUw: Downloading webpage
[info] Writing video subtitles to: ytsubs\3 10 11983 Udacity ML Course Lesson3 FINAL V2_g1i5ErYFaUw.vtt.en.vtt
[youtube] XpM7c9AAN3Y: Downloading webpage
[info] Writing video subtitles to

[youtube] wWODHPbb8no: Downloading webpage
[info] Writing video subtitles to: ytsubs\5 11 11983 Udacity ML Course Lesson5 FINAL V2_wWODHPbb8no.vtt.en.vtt
[youtube] dRB4fj3LD4o: Downloading webpage
[info] Writing video subtitles to: ytsubs\5 10 11983 Udacity ML Course Lesson5 FINAL V2_dRB4fj3LD4o.vtt.en.vtt
[youtube] oADXOnCXVNg: Downloading webpage
[info] Writing video subtitles to: ytsubs\5 09 11983 Udacity ML Course Lesson5 FINAL V2_oADXOnCXVNg.vtt.en.vtt
[youtube] PNKjNd0uJSI: Downloading webpage
[info] Writing video subtitles to: ytsubs\5 08 11983 Udacity ML Course Lesson5 FINAL V2_PNKjNd0uJSI.vtt.en.vtt
[youtube] Ar5V4a0OBug: Downloading webpage
[info] Writing video subtitles to: ytsubs\5 07 11983 Udacity ML Course Lesson5 FINAL V2_Ar5V4a0OBug.vtt.en.vtt
[youtube] S0pSSahJJIM: Downloading webpage
[info] Writing video subtitles to: ytsubs\5 Lab03 11983 Udacity ML Course Lesson5 FINAL_S0pSSahJJIM.vtt.en.vtt
[youtube] sDp10jUg9Bk: Downloading webpage
[info] Writing video subtitles to

We tried to download regular subtitles, however looking at the output we see that three videos didn't have subtitles, so we need to download the autogenerated subtitles.

In [6]:
for code in ['hwtrw64xQmQ','HXnLBA4OOsg','SR8akpb0zpE']: download_subs(code, auto=True)

[youtube] hwtrw64xQmQ: Downloading webpage
[youtube] hwtrw64xQmQ: Looking for automatic captions
[info] Writing video subtitles to: ytsubs\Microsoft Azure PROMO (2)_hwtrw64xQmQ.vtt.en.vtt
[youtube] HXnLBA4OOsg: Downloading webpage
[youtube] HXnLBA4OOsg: Looking for automatic captions
[info] Writing video subtitles to: ytsubs\3 01 11983 Udacity ML Course Lesson3 FINAL V2_HXnLBA4OOsg.vtt.en.vtt
[youtube] SR8akpb0zpE: Downloading webpage
[youtube] SR8akpb0zpE: Looking for automatic captions
[info] Writing video subtitles to: ytsubs\3 04 11983 Udacity ML Course Lesson3 FINAL V2_SR8akpb0zpE.vtt.en.vtt


Finally let's check that we have all the files. First list all the files in the directory.

In [7]:
import os
(_, _, filenames) = next(os.walk('ytsubs'))

Take the set of video codes from the dataframe and subtract the set of video codes in the filenames (last 11 characters of each filename without the extension) - we should end up with an empty set if we have all the files.

In [8]:
set(ytdf.video_url) - set([fn.split('.')[0][-11:] for fn in filenames])

set()