# Lecture Note Download Macro

This program automatically downloads all the lecture notes provided in a course from NTU open courseware (http://ocw.aca.ntu.edu.tw/ntu-ocw/).

## Packages
Several packages are required to run the web crawler. 
`selenium` runs the javascript in background. `BeautifulSoup` makes selecting HTML elements easier.
`requests` is used to download files. `os` is used to check and create the directory to save the files in. 

In [1]:
#GET
import requests

#Render Website (Javascript) using webdriver
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options() # 啟動無頭模式
chrome_options.add_argument('--headless')  #規避google bug
chrome_options.add_argument('--disable-gpu')

#Element Selector
from bs4 import BeautifulSoup as bs
from urllib import parse

import os

## Configurations

Several configurations such as the availible file extention (副檔名) of the handouts. The path where the files should be saved in is also assigned here.

In [2]:
# configuration
available_file_extention = [
    'pdf',
    'doc',
    'docx',
    'ppt',
    'pptx',
    'xls',
    'xlsx'
]

# path = '/Users/abc/Downloads/'
path = './' # current 

print(f"注意～檔案存於{path}")
    

注意～檔案存於./


In [3]:
def get_soup_by_url(url):
    """ Returns the soup(parsed HTML object) from the given url after front end codes are run"""
    driver = webdriver.Chrome("./chromedriver", options = chrome_options)

    driver.get(url)
    pageSourceCode = driver.page_source

    soup = bs(pageSourceCode, "html.parser")
    driver.close()
    return soup

## User Input
Asks the user to input the url of the course. Course name and course ids are crawled.

In [4]:
url = input("Enter course url:")
# url = 'http://ocw.aca.ntu.edu.tw/ntu-ocw/ocw/cou/106S201'

soup_course_site = get_soup_by_url(url)
# Get ocw code
url = soup_course_site.find("meta",  property="og:url")['content'] # to make sure the url doesn't include topic numbers
course_name = soup_course_site.select("h2.title")[0].get_text()
ocw_code = url.split('/')[-1]

#Get total number of lectures
lecture_list = soup_course_site.select('div.AccordionPanel')
lecture_num = len(lecture_list)

print('課程名稱\t', course_name)

# Get the list of lecture ids
# the id of each section starts with topxx, so we get rid of the 'top' and extract the id
ids = [l.get('id')[3:] for l in lecture_list]

Enter course url:http://ocw.aca.ntu.edu.tw/ntu-ocw/ocw/cou/104S204
課程名稱	 數位語音處理概論


## File URL searching
In this section, the program goes through all the course website (the direct ajax of the courses' section is called to reduce unnecessary elements to be loaded or rendered) and search for the file link.

Notice that all files of lecture notes is under the `a` tag, which is under the `div` with class `classnote`. Some urls in the `a` tag are not files, but link to videos instead. To target the url for files, simply split the url by "." and observe whether the last element, aka the file extention, matches those assigned in the configuration section.

In case that there might have more that one file in a lecture, and to make naming more intuitive in the later section, I save the urls in a dictionary, with the key being the lecture's id and the value being a list containing all the urls binded with the lecture.

In [5]:
print("檔案尋找中...")
to_download_url = dict()
file_count = 0
for i in ids:
    # to render the least data possible
    ajax_url = f"http://ocw.aca.ntu.edu.tw/ntu-ocw/cou-ajax/topic-content/{ocw_code}/{i}"
    soup_lecture = get_soup_by_url(ajax_url)
    classnote_url = soup_lecture.select(".classnote a")
    for u in classnote_url:
        u = u["href"]
        file_extention = u.split('.')[-1]
        if file_extention in available_file_extention:
            if i in to_download_url:
                to_download_url[i].append(u)
            else:
                to_download_url[i] = [u]
            file_count += 1
print (f"\n共計{file_count}個檔案\n")

檔案尋找中...

共計21個檔案



In [6]:
# make lecture path

path += course_name

if not os.path.isdir(path):
    os.mkdir(path)

## Download all urls

Download the files after collecting all the urls out of the course. Some if-else statements are used to systemize the file naming. 
Basically the file name is set to be `lecture + id1 + .xxx`, but in the case where there are two or more files in one lecture, the file name is set to be `lecture + id1 + id2 +.xxx`.

In [8]:
# download and rename

for key, value in to_download_url.items():
    file_name = 'lecture_'+str(key)
    if len(value) > 1:
        for i, url in enumerate(value):
            file_name_i =file_name + "_" + str(i+1)
            lecture_file = requests.get(url)
            file_extention = url.split('.')[-1]
            
            open(f'{path}/{file_name_i}.{file_extention}', 'wb').write(lecture_file.content)
            print(f"已下載\t{file_name_i}.{file_extention}")
    

    else: 
        url = value[0]
        file_extention = url.split('.')[-1]
        lecture_file = requests.get(url)
        open(f'{path}/{file_name}.{file_extention}', 'wb').write(lecture_file.content)

        print(f"已下載\t{file_name}.{file_extention}")

已下載	lecture_1_1.pptx
已下載	lecture_1_2.pdf
已下載	lecture_2.pptx
已下載	lecture_3.pptx
已下載	lecture_4.pptx
已下載	lecture_5.pptx
已下載	lecture_6.pptx
已下載	lecture_7.pptx
已下載	lecture_8.pptx
已下載	lecture_9.pptx
已下載	lecture_10.pptx
已下載	lecture_11.pptx
已下載	lecture_12_1.pptx
已下載	lecture_12_2.pptx
已下載	lecture_13.pptx
已下載	lecture_14.pptx
已下載	lecture_15.pptx
已下載	lecture_16.pptx
已下載	lecture_17.pptx
已下載	lecture_18.pptx
已下載	lecture_19.pptx
