ME204 - Data Engineering for the Social World

# **Notebook 1 - Data Collection**

Goals:
 - Use Selenium and the Chrome webdriver to scrape edX's website for course info.
 - Implement try-catch blocks to capture courses with differently-structured pages, ensuring that the correct data is collected.
 - Save all links into a raw CSV and set a foundation for data cleaning.

Installs:

In [3]:
# !pip install requests scrapy pandas selenium 

Importing necessary packages:


In [4]:
#For sending HTTP requests, may not be needed with selenium but installing just in case
import requests
# For disabling UserWarning
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
# For parsing HTML content
from scrapy.selector import Selector
# To read data into a CSV and clean in another notebook
import pandas as pd
import os

Importing selenium modules, constructing the driver, and a test is ran to pull cookies to ensure the site is interactable:

In [5]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome()
driver.get("https://www.edx.org/search?tab=course")
driver.get_screenshot_as_png()
driver.implicitly_wait(10)
all_cookies = driver.get_cookies
print({"All cookies": all_cookies})
driver.quit()

{'All cookies': <bound method WebDriver.get_cookies of <selenium.webdriver.chrome.webdriver.WebDriver (session="1b9965f9659c8fb6457bf393ce3e4504")>>}


This is a demo to grab the course names, providers, and links of the first page's courses:

In [6]:
driver = webdriver.Chrome()
driver.get("https://www.edx.org/search?tab=course")
wait = WebDriverWait(driver, 10) 

courses = []
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[contains(@class, "pgn__card-header-title-md")]/span')))
courses_element = driver.find_elements(By.XPATH, './/div[contains(@class, "pgn__card-header-title-md")]/span')
for element in courses_element:
    courses.append(element.text)
courses

providers= []
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[contains(@class, "pgn__card-header-title-md")]/span')))
provider_element = driver.find_elements(By.XPATH, './/div[contains(@class, "pgn__card-header-subtitle-md")]/span')
for element in provider_element:
    providers.append(element.text)
providers

wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[contains(@class, "pgn__card-header-title-md")]/span')))
elements = driver.find_elements(By.XPATH, './/a[contains(@class, "base-card-link")]')
links = [element.get_attribute("href") for element in elements]
links

driver.quit()
print(courses[0:5])
print(providers[0:5])
print(links[0:5])


['How to Learn Online', 'The Science of\nHappiness', 'Remote Work\nRevolution for\nEveryone', "CS50's Introduction to\nComputer Science", 'Data Visualization and\nBuilding Dashboards\nwith Excel and Cogn…']
['edX', 'University of California, Ber…', 'Harvard University', 'Harvard University', 'IBM']
['https://www.edx.org/learn/how-to-learn/edx-how-to-learn-online?index=product&queryID=2580824d2b652cfb688b7230e08a2cc1&position=1&results_level=second-level-results&term=&objectID=course-0e575a39-da1e-4e33-bb3b-e96cc6ffc58e&campaign=How+to+Learn+Online&source=edX&product_category=course&placement_url=https%3A%2F%2Fwww.edx.org%2Fsearch', 'https://www.edx.org/learn/happiness/university-of-california-berkeley-the-science-of-happiness?index=product&queryID=2580824d2b652cfb688b7230e08a2cc1&position=2&results_level=second-level-results&term=&objectID=course-73484215-4007-48cd-ba90-c945cde6030d&campaign=The+Science+of+Happiness&source=edX&product_category=course&placement_url=https%3A%2F%2Fwww.edx

Now, I'll define a function to paginate the course name, provider, and links for all 42 pages of courses on edX.

In [7]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

#Lists to save the scraped data
courses = []
providers = []
links = []

# Defining the function with pages having the default value 42, but this can be changed in case edX adds more courses and thus more pages
def scrape_courses(pages):

    # Initialize the WebDriver (Make sure you have the correct driver executable for your browser)
    driver = webdriver.Chrome()

    # Base URL for edX course search with pagination
    base_url = 'https://www.edx.org/search?tab=course&page={}'

    # WebDriverWait instance
    wait = WebDriverWait(driver, 10)  # Wait up to 10 seconds for elements to be located

    # Loop through page 1 page 'n'
    for page in range(1,pages+1):
        # Open the webpage
        driver.get(base_url.format(page))
    
        # Wait for course name elements to be present
        wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[contains(@class, "pgn__card-header-title-md")]/span')))
        courses_element = driver.find_elements(By.XPATH, './/div[contains(@class, "pgn__card-header-title-md")]/span')
        for element in courses_element:
            courses.append(element.text)
    
        # Wait for provider elements to be present
        wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[contains(@class, "pgn__card-header-subtitle-md")]/span')))
        provider_element = driver.find_elements(By.XPATH, './/div[contains(@class, "pgn__card-header-subtitle-md")]/span')
        for element in provider_element:
            providers.append(element.text)
    
    # Wait for link elements to be present
        wait.until(EC.presence_of_all_elements_located((By.XPATH, './/a[contains(@class, "base-card-link")]')))
        elements = driver.find_elements(By.XPATH, './/a[contains(@class, "base-card-link")]')
        for element in elements:
            links.append(element.get_attribute('href'))

    # Close the WebDriver
    driver.quit()

    # Print or process the collected data
    return courses
    return providers
    return links

scrape_courses(42)

['How to Learn Online',
 'The Science of\nHappiness',
 'Remote Work\nRevolution for\nEveryone',
 "CS50's Introduction to\nComputer Science",
 'Data Visualization and\nBuilding Dashboards\nwith Excel and Cogn…',
 'Six Sigma Part 2:\nAnalyze, Improve,\nControl',
 "CS50's Introduction to\nProgramming with\nPython",
 'Analyzing Data with\nExcel',
 'The Science of\nGenerosity: Do\nGood…Feel Good',
 'Entrepreneurial\nOperations:\nLaunching a Startup',
 'Marketing\nManagement',
 'The Foundations of\nHappiness at Work',
 'Introduction to Web\nAccessibility',
 'Coaching Skills for\nLearner-Centred\nConversations',
 'Creating Innovative\nBusiness Models',
 'Greatest Unsolved\nMysteries of the\nUniverse',
 'Interdisciplinary\nTeaching with\nMuseum Objects',
 'Intercultural\nCommunication at\nWork – Land the job…',
 'Data Science and\nAgile Systems for\nProduct Management',
 'Climate Change:\nFinancial Risks and\nOpportunities',
 'Strategies for Online\nTeaching and\nLearning',
 'Exercising\nLeade

**EFFICIENCY NOTE:** The runtime of this program was 4m 59.4s, which is 299.4s. I wrote down a sample of 10 course names, providers and links which took 231.6s. Scaling up, this means I would manually get all 1000 course names, providers, and links in an estimated 23160s, which is roughly 6.4 hours! The percent change from automatic vs. manual scraping is |(299.4s-23160s)|/(23160s) ~ 0.987 (or 98.7%) faster.

Checking if the length of the courses, providers, and links is correct to ensure every course was captured:

In [8]:

if [len(courses), len(providers), len(links)] == [1000,1000,1000]:
    print("All courses were captured.")
else:
    print("There is a run-time error above.")

All courses were captured.


Now, I have to access each course link and scrape the course length, time commitment, course start date, current enrollment, pace (self or instructor), price (of certificate) and subject data.

Here is a demo that scrapes all the variables above (only one link) just to lay the groundwork. If this code correctly scrapes, then we can use it inductively to grab relevant information from all course links.

In [9]:
driver = webdriver.Chrome()
driver.get("https://www.edx.org/learn/how-to-learn/edx-how-to-learn-online?index=product&queryID=d1085829cef82e17b14c6c8dcde7983e&position=1&results_level=second-level-results&term=&objectID=course-0e575a39-da1e-4e33-bb3b-e96cc6ffc58e&campaign=How+to+Learn+Online&source=edX&product_category=course&placement_url=https%3A%2F%2Fwww.edx.org%2Fsearch")

wait = WebDriverWait(driver, 10)
course_length = wait.until(EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'ml-3')]/div[contains(@class, 'h4 mb-0')]")))
print(course_length.text)

time_commitment = driver.find_element(By.XPATH, "//div[contains(@class, 'ml-3')]/div[contains(@class, 'small')]")
print(time_commitment.text)

course_start_date = driver.find_element(By.XPATH, "//div[@class='text-center mb-3.5']/div[@class='h4 mb-0']")
print(course_start_date.text)

current_enrollment = (driver.find_element(By.XPATH, "//div[@class='small']/div[@class='small']"))
print(int((current_enrollment.text.split(" ")[0].replace(",",""))))

pace = driver.find_element(By.XPATH, "//div[@class='d-flex align-items-start col-12 col-md-4']/div[@class='ml-3']/div[@class='h4 mb-0']")
print(pace.text)

price = driver.find_element(By.XPATH, "//tr[th[h4[text()='Price']]]/td[1]/p")
print(price.text)

subject = driver.find_element(By.XPATH, "//li[span[@class='font-weight-bold' and contains(text(), 'Subject:')]]/a")
print(subject.text)

# Optional: Close the driver after use
driver.quit()


2 weeks
2–3 hours per week
Starts Jul 27
287998
Self-paced
£54
Education & Teacher Training


Now, I'll scrape every course from the `links` list and to obtain course-specific data, saved into lists. I'll create separate functions for each variable, shown below:

In [12]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def fetch_course_length(driver, wait):
    try:
        return wait.until(EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'ml-3')]/div[contains(@class, 'h4 mb-0')]"))).text
    except:
        return "NA"

def fetch_time_commitment(driver):
    try:
        return driver.find_element(By.XPATH, "//div[contains(@class, 'ml-3')]/div[contains(@class, 'small')]").text
    except:
        return "NA"

def fetch_course_start_date(driver):
    try:
        return driver.find_element(By.XPATH, "//div[@class='text-center mb-3.5']/div[@class='h4 mb-0']").text
    except:
        return "NA"

def fetch_current_enrollment(driver):
    try:
        enrollment_text = driver.find_element(By.XPATH, "//div[@class='small']/div[@class='small']").text
        return int(enrollment_text.split(" ")[0].replace(",", ""))
    except:
        return 0

def fetch_pace(driver):
    try:
        return driver.find_element(By.XPATH, "//div[@class='d-flex align-items-start col-12 col-md-4']/div[@class='ml-3']/div[@class='h4 mb-0']").text
    except:
        return "NA"

def fetch_price(driver):
    try:
        return driver.find_element(By.XPATH, "//tr[th[h4[text()='Price']]]/td[1]/p").text
    except:
        return "£0"

def fetch_subject(driver):
    try:
        return driver.find_element(By.XPATH, "//li[span[@class='font-weight-bold' and contains(text(), 'Subject:')]]/a").text
    except:
        return "NA"

def scrape_course_data(links):
    course_length = []
    time_commitment = []
    course_start_date = []
    current_enrollment = []
    pace = []
    price = []
    subject = []

    driver = webdriver.Chrome()
    wait = WebDriverWait(driver, 10)

    for i in links:
        driver.get(i)
        
        try:
            course_length.append(fetch_course_length(driver, wait))
        except Exception as e:
            print(f"Course Length threw an error - Skipping page {i}: {e}")

        try:
            time_commitment.append(fetch_time_commitment(driver))
        except Exception as e:
            print(f"Time Commitment threw an error - Skipping page {i}: {e}")

        try:
            course_start_date.append(fetch_course_start_date(driver))
        except Exception as e:
            print(f"Course Start Date threw an error - Skipping page {i}: {e}")

        try:
            current_enrollment.append(fetch_current_enrollment(driver))
        except Exception as e:
            print(f"Current Enrollment threw an error - Skipping page {i}: {e}")

        try:
            pace.append(fetch_pace(driver))
        except Exception as e:
            print(f"Pace threw an error - Skipping page {i}: {e}")

        try:
            price.append(fetch_price(driver))
        except Exception as e:
            print(f"Price threw an error - Skipping page {i}: {e}")

        try:
            subject.append(fetch_subject(driver))
        except Exception as e:
            print(f"Subject threw an error - Skipping page {i}: {e}")

    driver.close()

    return {
        "course_length": course_length,
        "time_commitment": time_commitment,
        "course_start_date": course_start_date,
        "current_enrollment": current_enrollment,
        "pace": pace,
        "price": price,
        "subject": subject
    }

results = scrape_course_data(links)

print(results["course_length"])
print(results["time_commitment"])
print(results["course_start_date"])
print(results["current_enrollment"])
print(results["pace"])
print(results["price"])
print(results["subject"])


['2 weeks', '11 weeks', '3 weeks', '12 weeks', '4 weeks', '8 weeks', '10 weeks', '5 weeks', '4 weeks', '4 weeks', '9 weeks', '4 weeks', '4 weeks', '4 weeks', '4 weeks', '9 weeks', '14 weeks', '6 weeks', '4 weeks', '4 weeks', 'NA', '4 weeks', '3 weeks', '7 weeks', '1 weeks', '12 weeks', '8 weeks', '5 weeks', '3 weeks', '6 weeks', '8 weeks', '8 weeks', '8 weeks', '7 weeks', '6 weeks', '8 weeks', '5 weeks', '4 weeks', '7 weeks', '4 weeks', '5 weeks', '4 weeks', '4 weeks', '2 weeks', '3 weeks', '6 weeks', '14 weeks', '6 weeks', '6 weeks', '8 weeks', '12 weeks', '6 weeks', '5 weeks', '12 weeks', '7 weeks', '12 weeks', '14 weeks', '3 weeks', '10 weeks', '8 weeks', '4 weeks', '8 weeks', '10 weeks', '3 weeks', '4 weeks', '6 weeks', '6 weeks', '4 weeks', '5 weeks', '10 weeks', '4 weeks', '13 weeks', '6 weeks', '5 weeks', '10 weeks', '8 weeks', '2 weeks', '11 weeks', '16 weeks', '4 weeks', '6 weeks', '5 weeks', '4 weeks', '3 weeks', '6 weeks', '5 weeks', '6 weeks', '7 weeks', '4 weeks', '8 weeks

The runtime was *long* for this code, but I'll check that all variables were grabbed:

In [13]:
len(results["course_length"]) == len(results["course_start_date"]) == len(results["current_enrollment"]) == len(results["pace"]) == len(results["price"]) == len(results["subject"]) == len(results["time_commitment"])

True

Moreover, the length of any of these variables should be 1000:

In [14]:
print(len(results["course_length"]))

1000


Now I'll create a DataFrame of the scraped data and save it to a CSV:

In [15]:
df_uncleaned = pd.DataFrame({"course": courses, 
                             "provider": providers, 
                             "link": links, 
                             "course_length" : results["course_length"],
                             "time_commitment": results["time_commitment"],
                             "start_date": results["course_start_date"],
                             "current_enrollment": results["current_enrollment"],
                             "pace": results["pace"],
                             "price":results["price"],
                             "subject": results["subject"]})

Great, let's display a preview of this DataFrame:

In [16]:
df_uncleaned.head()

Unnamed: 0,course,provider,link,course_length,time_commitment,start_date,current_enrollment,pace,price,subject
0,How to Learn Online,edX,https://www.edx.org/learn/how-to-learn/edx-how...,2 weeks,2–3 hours per week,Starts Jul 27,287998,Self-paced,£54,Education & Teacher Training
1,The Science of\nHappiness,"University of California, Ber…",https://www.edx.org/learn/happiness/university...,11 weeks,4–5 hours per week,Starts Jul 27,583920,Self-paced,£131,Social Sciences
2,Remote Work\nRevolution for\nEveryone,Harvard University,https://www.edx.org/learn/remote-work/harvard-...,3 weeks,2–3 hours per week,Starts Jul 27,114666,Self-paced,£116,Business & Management
3,CS50's Introduction to\nComputer Science,Harvard University,https://www.edx.org/learn/computer-science/har...,12 weeks,6–18 hours per week,Starts Jul 27,5988868,Self-paced,£0,Computer Science
4,Data Visualization and\nBuilding Dashboards\nw...,IBM,https://www.edx.org/learn/data-visualization/i...,4 weeks,2–3 hours per week,Starts Jul 27,48226,Self-paced,£77,Data Analysis & Statistics


And finally save it to a CSV:

In [18]:
os.chdir("../data/raw")
df_uncleaned.to_csv("raw.csv")

This marks the end of this notebook. See `NB02-Build-Database.ipynb` for building a database.