*This scrapping exercise is done purely for educational purposes.*
*All videos belong to Harvard University*

# Code to scrape CS109 2015 videos.

## History and Motivation

I'm rather unhappy with this implementation. I want to download these videos as I'll be starting work soon, and I'm not going to have the luxury of time to keep up with the course. I think I might take a few months to work through the course, and I'm worried they are going to take down the videos. I see hints that they're going to be starting a 2017 run soon. My fear is they are going to stop hosting the vids when that course starts, just as the 2013 and 2014 vids are now gone. So this is more for my own insurance and a reluctance for my learning to be beholden to anyone else's schedule.

I thought it would be good practice as well, to try and scrape a website.

The website design is simple enough.

A [main page](https://matterhorn.dce.harvard.edu/engage/ui/index.html#/2016/01/14328) has links to [viewer](https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1) pages, which nests either [one video](https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=62b95e14-c296-44da-9691-446dfa313836) of the instructor or [two videos](https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=e15f221c-5275-4f7f-b486-759a7d483bc8), one of the instructor and one of the slides.

However, a major stumbling block to my implementation is that they used some Javascript implementation. I tried briefly to scrape the data, trying an [I'm Pythonist](https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/) code, followed by a suggestion from [Stack Overflow](https://stackoverflow.com/questions/35241872/how-to-download-from-javascript-rendered-webpage) which was very similar and seemed to hang. There was [this](https://github.com/christopher-beckham/cs109-dl-videos) github script based off a [Quora](http://www.quora.com/Downloads/How-can-I-download-the-videos-for-CS109-Harvards-Data-Science-Course) answer, but it seemed to have been for the 2013/14 videos, and was hardcoded to those (outdated) links.

Seeing that the best solution seemed to be hardcoded, I decided to bite the bullet and get the html from each page manually. This took most of my time, especially my 2GB RAM. What do. I'm sure it's possible to get python to load the webpage by masking as a browser - probably a setting buried in PyQt somewhere. Still, I managed to get this program out in one day - something I'm sure I would not have been able to do if I tried to figure that out.

## How it works

Recall: 

A [main page](https://matterhorn.dce.harvard.edu/engage/ui/index.html#/2016/01/14328) has links to [viewer](https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1) pages, which nests either [one video](https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=62b95e14-c296-44da-9691-446dfa313836) of the instructor or [two videos](https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=e15f221c-5275-4f7f-b486-759a7d483bc8), one of the instructor and one of the slides.

1. html of **main** site saved in "h1_main.txt"
2. extracts all **viewer** links as a list of tuples *(name, date, link)* to "list_add.txt"
3. manually parse to delete unwanted links, modify names. 
4. go to every **viewer** and save html in "Vid HTMLS/xx.html"
5. Extract video link(s) and save in a dict with earlier tuple as key.
6. Finally, begin to save videos. 

Modify *prefix* at point indicated to change save location.

In [1]:
import bs4
import urllib

In [2]:
with open("h1_main.txt") as h1:
    h1_main = h1.read()
har_main = bs4.BeautifulSoup(h1_main)
print har_main.prettify()

# get titles
titles = [l.get('title') for l in har_main.findAll('div', {'class':"publication-title auto-launch"})]
title_list = []
for i in titles:
    title_list.append(i[17:])

# get links
link_list = [l.get('href') for l in har_main.findAll('a', {'class':"item-link", 'data-is-default-link':"true"})]

# get dates
date_list = [l.get('data-date-time') for l in har_main.findAll('div', {'class':"pub-date ng-binding"})]



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


<html>
 <body>
  <!-- Distance Education Header and banner  -->
  <div class="banner extension" data-ng-class="{'embedded':embedded || ltitool}" id="header-banner">
   <!-- Generalized Harvard (Summer/Extension agnostic) Header -->
   <div class="topbar">
    <div id="topbar-flex">
     <a href="http://www.harvard.edu/" title="The Harvard University web page">
      <span class="topbar__logo i-harvard-logo ir">
       Harvard University
      </span>
     </a>
     <div id="report-a-problem">
      <a href="//cm.dce.harvard.edu/forms/report.shtml?server=MH&amp;offeringId=20160114328" title="Link to Report A Problem Form">
       Report A
            Problem
      </a>
     </div>
    </div>
   </div>
  </div>
  <!-- ngView: ng-view -->
  <div class="ng-scope" data-ng-view="ng-view" id="listing-body">
   <div class="container container-fluid pubData ng-scope" data-partial-name="publist">
    <!-- title section -->
    <div id="title-section">
     <!-- ngIf: pubData.series.count == 1 --

In [3]:
# to remove file if it exists
import os
try: os.remove('list_add.txt')
except: pass

# write to file, manually remove unwanteds
with open('list_add.txt', 'a') as f:
    for line in zip(title_list, date_list, link_list):
        f.write(str(line))
        f.write('\n')

In [4]:
# import in the files you want to save.
with open('list_add(selected).txt', 'r') as f:
    wanted_list = []
    for line in f:
        if line[0] != '(':
            continue
        # catch wrong filenames
        if len(line[2:-3].split("', '")) != 3:
            print 'ERROR!', line
        c = line[2:-3].split("', '")
        wanted_list.append((c[0], c[1], c[2]))

index = 1
for n,d,w in wanted_list:
    print index, w
    index +=1
    
# manually go to the sites and save the html

1 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1
2 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=32a2d292-50eb-44b0-9f28-a728e515d612
3 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=128b8123-a1a6-493c-bac7-a932234374a0
4 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=12bfea44-634f-4bc0-b88d-0aca05a3c289
5 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=94e52a8d-6557-48c4-b003-b5ec84d2a1e2
6 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=34758fd3-9896-4461-966a-7971e349fee3
7 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b1d70f08-4c37-4ca7-9fd1-769f4a5adbd2
8 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=03277e71-f8f1-443b-b13a-7e54f762b287
9 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=5ca8d569-0c51-47aa-83df-147cc4b97e57
10 https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=3c49a6e1-b9e6-47

> Note: Rename **prefix** as per where you want videos saved!

In [5]:
# make filename, add to tuple in wanted list
convert_mth = {'Dec':'12', 'Nov':'11', 'Oct':'10', 'Sep':'09'}
final_names = []
prefix = '../../../../../../BUFFALO/Harvard/vids/'
for n,d,w in wanted_list:
    d1 = d.split()
    final_names.append((n,d,w, 
                        prefix + convert_mth[d1[1]] + d1[0] + ' - ' + n.split()[0] + ' ' + n.split()[1] + '.mp4'))
final_names[0]

('Final Project Presentations (CAPTIONS)',
 '16 Dec 2:30AM',
 'https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1',
 '../../../../../../BUFFALO/Harvard/vids/1216 - Final Project.mp4')

In [6]:
# make a dict to store webbys
webby_dict = {}

for i in xrange(39):
    if i < 9:
        file_name = "Vid HTMLS/0%d.html" % (i + 1)
    else:
        file_name = "Vid HTMLS/%d.html" % (i + 1)

    # open file
    content, vid_links = None, []
    with open(file_name, 'r') as f:
        content = f.read()
    content = bs4.BeautifulSoup(content)

    # get links as a list
    vid_links = [l.get('src') for l in content.findAll('source', {'type':"video/mp4"})]
    webby_dict[final_names[i]] = vid_links
    
webby_dict.items()[0]

# format of dict : (name, date, web_viewer website, name of save file) = [main_video, slides_video]

(('Lecture 17 Advanced Bayesian Thinking. (CAPTIONS)',
  '30 Oct 1:30AM',
  'https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=35282cbd-94b3-4fd7-bd5e-b3a8f40e72b1',
  '../../../../../../BUFFALO/Harvard/vids/1030 - Lecture 17.mp4'),
 ['https://da4w749qm6awt.cloudfront.net/engage-player/35282cbd-94b3-4fd7-bd5e-b3a8f40e72b1/25d2319e-a727-46ea-a6eb-97dc6309ca71/presenter_delivery.mp4',
  'https://da4w749qm6awt.cloudfront.net/engage-player/35282cbd-94b3-4fd7-bd5e-b3a8f40e72b1/1a2f075f-d54b-4679-a985-ea774816a660/presentation_delivery.mp4'])

In [10]:
webby_keys = webby_dict.keys()
# sorts by date 
webby_keys = sorted(webby_keys, key=lambda x : convert_mth[x[1].split()[1]] + x[1].split()[0])
webby_keys[38]

('Final Project Presentations (CAPTIONS)',
 '16 Dec 2:30AM',
 'https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1',
 '../../../../../../BUFFALO/Harvard/vids/1216 - Final Project.mp4')

In [8]:
# get a few files at a time (worried about storage capacity)
test = urllib.URLopener()

for n in xrange(39):
    dl_list = webby_dict[webby_keys[n]]
    dl_name = webby_keys[n][3]
    dl_name_slides = webby_keys[n][3][:-4] + ' (slides).mp4'
    # get the first vid
    print 'starting DL #%d ...' % n
    test.retrieve(dl_list[0], dl_name)
    # get the second vid (some don't have)
    print 'starting slides DL #%d ...' % n
    try: 
        test.retrieve(dl_list[1], dl_name_slides)
        print "completed!"
    except: 
        print "no slides, completed!"
        pass

IndexError: list index out of range

> You have saved all vids till index 13!

## Scraping Basics - the Lecture 2 way

In [1]:
## all imports
from IPython.display import HTML
import numpy as np
import urllib2
import bs4 #this is beautiful soup
import time
import operator
import socket
import cPickle
import re # regular expressions

from pandas import Series
import pandas as pd
from pandas import DataFrame

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_context("talk")
sns.set_style("white")

In [2]:
url = "https://matterhorn.dce.harvard.edu/engage/ui/index.html#/2016/01/14328"
source = urllib2.urlopen(url).read()
har_main = bs4.BeautifulSoup(source)
print har_main.prettify()

<!DOCTYPE html>
<html data-ng-app="pubListApp" lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="HUDCE Publication Listing" name="description"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Publication Listing
  </title>
  <link href="combined/css/ocDcePubListing.css" rel="stylesheet"/>
 </head>
 <body>
  <!-- Distance Education Header and banner  -->
  <div class="banner {{school}}" data-ng-class="{'embedded':embedded || ltitool}" id="header-banner">
   <!-- Generalized Harvard (Summer/Extension agnostic) Header -->
   <div class="topbar">
    <div id="topbar-flex">
     <a href="http://www.harvard.edu/" title="The Harvard University web page">
      <span class="topbar__logo i-harvard-logo ir">
       Harvard University
      </span>
     </a>
     <div id="report-a-problem">
      <a href="//cm.dce.harvard.edu/forms/report



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Page is implemented with Javascript! What to do?

I turned to [this guy](https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/) for some help. Here's his method .. 

## Scraping Javascript pages

1. Install PyQT4

>  `sudo apt-get install python-qt4`

    Unfortunately, instructions use pyqt4. We have pyqt-5 on our build. Can the instructions be modified?

In [4]:
import sys

# contains the core classes, including the event loop and Qt’s signal and slot mechanism. It also includes platform independent abstractions for animations, state machines, threads, mapped files, shared memory, regular expressions, and user and application settings.
from PyQt5.QtCore import *
# contains classes for windowing system integration, event handling, 2D graphics, basic imaging, fonts and text. It also containes a complete set of OpenGL and OpenGL ES bindings
from PyQt5.QtGui import *
# QtWebKit contains classes for a WebKit2 based implementation of a web browser.
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import *
from PyQt5.QtWidgets import *

In [5]:
# impythonista

class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()  

    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit()

In [7]:
# IMPYTHONISTA

url = "https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1"
r1 = Render(url)
# result is a QString.
result = r1.frame.toHtml()
print type(result)

<type 'unicode'>


In [8]:
import bs4
har_main = bs4.BeautifulSoup(result)
print har_main.prettify()

<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8;" http-equiv="Content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <title>
   DCE Video Player
  </title>
  <script src="javascript/swfobject.js" type="text/javascript">
  </script>
  <script src="javascript/base.js" type="text/javascript">
  </script>
  <script src="javascript/jquery.js" type="text/javascript">
  </script>
  <script src="javascript/lunr.min.js" type="text/javascript">
  </script>
  <!-- DCE additional js start -->
  <script src="mh_dce_resources/js/jquery-ui.min.js" type="text/javascript">
  </script>
  <script src="resources/bootstrap/js/bootstrap.min.js" type="text/javascript">
  </script>
  <!-- DCE additional js end -->
  <!-- DCE adding require back in, UPV uses it to pull in dashjs -->
  <script src="javascript/require.js" type="text/javascript">
  </script>
  <script src="javascript/paella_player.js" type="text/jav

Problem : browser outdated.

Attempting [stack overflow](https://stackoverflow.com/questions/35241872/how-to-download-from-javascript-rendered-webpage) method

In [None]:
import sys

from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication

class Render(QWebPage):
    """Render HTML with PyQt5 WebKit."""

    def __init__(self, html):
        self.html = None
        print 1
        self.app = QApplication(sys.argv)
        print 2
        QWebPage.__init__(self)
        print 3
        self.loadFinished.connect(self._loadFinished)
        print 4
        self.mainFrame().setHtml(html)
        print 5
        self.app.exec_() # Spending a long time here
        print 6

    def _loadFinished(self, result):
        print 7
        self.html = self.mainFrame().toHtml()
        print 8
        self.app.quit()
        print 9

url = "https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=b721f1db-9c16-464b-bcd3-3d81e7cd13a1"
print 'beginning render'
r1 = Render(url).html
print 'rendered, beg b4'
import bs4
har_main = bs4.BeautifulSoup(r1)
print har_main.prettify()

beginning render
1
2
3
4
7
8
9
5


Seems to get stuck loading. 

Okay. What if we didn't scrape the main page, since it's a one off thing, and just copied in the html?