## Dataset 1: UCSD CAPE Reviews
- Contains the student evaluations for each class at UCSD, including percentage of students who recommend the instructor and the class, the average number of weeks spent on the class, and the average grade expected and received for the class
- To run the code below, you will need the following:
    - You will need to first log in to your UCSD account so that the cookies on the cape.ucsd.edu website contains your log-in information that is necessary to access the CAPE reviews
    - Once you've logged into your UCSD account, go to the cape.ucsd.edu page and download the cookies in a json format. One way to do this is to use the Chrome extension "Export cookie JSON file for Puppeteer" (https://chromewebstore.google.com/detail/export-cookie-json-file-f/nmckokihipjgplolmcmjakknndddifde?pli=1)

Credit to u/MaxtheBat on Reddit for the code below. He posted the code on scraping UCSD CAPE data in the Reddit post below:

https://www.reddit.com/r/UCSD/comments/14uh5q5/since_capes_is_being_retired_i_scraped_all_its/

In [1]:
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd
import re

In [10]:
# Load in cookies
cookies_raw = json.load(open('cookies/cape.ucsd.edu.cookies.json', 'r'))
cookies = {cookie['name']: cookie['value'] for cookie in cookies_raw}

In [11]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36',
    'Accept-Encoding': '*',
    'Connection': 'keep-alive'
}

In [12]:
url = 'https://cape.ucsd.edu/responses/Results.aspx?Name=%2C&CourseNumber='

In [13]:
# Initiate get request to CAPEs (with all entries)
response = requests.get(url, cookies=cookies, headers=headers)

In [15]:
# Parse request and scrape table
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
table_body = table.find('tbody')
rows = table_body.find_all('tr')

In [16]:
# Parse each row for data and put in list
data = []
for row in rows:
    cols = row.find_all('td')
    url = 'https://cape.ucsd.edu/' + row.find('a')['href'].strip('../')
    cols = [ele.text.strip().replace(',', '') for ele in cols]
    cols.append(url)
    data.append([ele for ele in cols if ele])

In [37]:
# Open file
with open('data/capes_data.csv', 'w', encoding='utf-8') as file:
    # Write file header
    file.write('Instructor,Course,Quarter,Total Enrolled in Course,Total CAPEs Given,Percentage Recommended Class,Percentage Recommended Professor,Study Hours per Week,Average Grade Expected,Average Grade Received,Evalulation URL\n')
    
    # Write course data
    for course in data:
        file.write(','.join(course))
        file.write('\n')

## Dataset 2: UCSD Course Catalog

In [47]:
catalog_url = 'https://catalog.ucsd.edu/front/courses.html'

In [3]:
# Fetch the content from the URL
response = requests.get(catalog_url)
content = response.content

In [4]:
# Parse the content with BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

In [5]:
# Find all links in the page
links = soup.find_all('a', href=True)

In [6]:
links

[<a class="sr-only skip-to-main" href="#main-content">Skip to main content</a>,
 <a class="title-header title-header-large" href="/">
             General Catalog
     </a>,
 <a class="title-header title-header-short" href="/">
             General Catalog
     </a>,
 <a class="title-logo" href="http://www.ucsd.edu">UC San Diego</a>,
 <a href="https://catalog.ucsd.edu/front/courses.html">Courses/Curricula/Faculty</a>,
 <a href="../about/index.html">About <span class="caret"></span> </a>,
 <a href="../about/about-uc-san-diego/index.html">About UC San Diego</a>,
 <a href="https://catalog.ucsd.edu/academic-integrity.html">Academic Integrity</a>,
 <a href="../about/policies/index.html">Regulations &amp; Policies</a>,
 <a href="../about/calendars/index.html">Calendars</a>,
 <a href="../about/additional-resources/index.html">Additional Resources</a>,
 <a href="../undergraduate/index.html">Undergraduate <span class="caret"></span> </a>,
 <a href="../undergraduate/overview/index.html">Undergra

In [7]:
# Get all links for the department pages that contain course information
base_url = "../courses/"
department_links = [link['href'] for link in links if link['href'].startswith(base_url)]
department_links = ["https://catalog.ucsd.edu" + link.strip('..') for link in department_links]

In [9]:
missing_units = {'COMM 101A': '4', 'HIEU 124': '4', \
                 'HILA 119': '4', 'JAPN 180': '4', \
                 'JAPN 190': '4', 'LIGN 9GS': '4', \
                 'USP 131': '4', 'USP 141A': '6', \
                 'USP 141B': '6'}

In [10]:
def split_courses(course_string):
    # format similar to BENG/BIMM/CSE 181A
    if bool(re.match("^([A-Z]+\/)+[A-Z]+ \d+\w*$", course_string)):
        depts = course_string.split()[0].split('/')
        expanded_courses = [dept + " " + course_string.split()[-1] for dept in depts]
    
    # format similar to BGGN 249A-B-C OR EDS 129 A-B-C OR ECE 145AL-BL-CL
    elif bool(re.match("^[A-Z]+ \d+[A-Z]* *(?:[A-Z]+-?)+$", course_string)):
        # Extract the department and course number
        parts = course_string.split(' ')
        department = parts[0]
        number_part = parts[1] if len(parts) == 2 else parts[1] + parts[2]

        # Extract the base course number and the letter sequences
        base_number = ''.join(filter(str.isdigit, number_part))
        letter_sequences = re.findall(r'[A-Z]+', number_part)

        # Construct the individual course codes
        expanded_courses = [f"{department} {base_number}{seq}" for seq in letter_sequences]
    
    # format similar to HIUS 167/267/ETHN 180
    elif bool(re.match("^[A-Z]+ \d+\/\d+\/\w+ \d+$", course_string)):
        split_by_par = course_string.split("/")
        first_dept = split_by_par[0].split()[0] + " "
        expanded_courses = [split_by_par[0], first_dept + split_by_par[1], split_by_par[2]]
    
    # format similar to CHIN 160/260
    elif bool(re.match("^[A-Z]+ \d+\/\d+\w*$", course_string)):
        split_by_par = course_string.split("/")
        dept = split_by_par[0].split()[0] + " "
        expanded_courses = [split_by_par[0], dept + split_by_par[1]]
        
    # format similar to EDS 31/CHEM 96
    elif bool(re.match("^[A-Z]+ \d+\w*(\/\w+ \w* *\d+\w*)*$", course_string)):
        expanded_courses = course_string.split("/")
        
    # format similar to GSS 21-22-23-25-26-27
    elif bool(re.match("^[A-Z]+ \d+(-\d+)+$", course_string)):
        course_split = course_string.split()
        course_nums = course_split[1].split('-')
        expanded_courses = [course_split[0] + " " + i for i in course_nums]
        
    # format similar to HMNR 101/ANSC 140 or COMM 114A
    elif bool(re.match("^[A-Z]+ \d+\/\w+ \d+ or \w+ \d+\w*$", course_string)):
        course_split = course_string.split("/")
        course_split_or = course_split[1].split(" or ")
        expanded_courses = [course_split[0]] + course_split_or
    
    # format similar to LIGM 5A, 5B, 5C, 5D
    elif bool(re.match("^[A-Z]+ \d+\w*(, \d+\w*)+$", course_string)):
        courses_split = course_string.split(", ")
        course_dept = courses_split[0].split()[0]
        expanded_courses = [courses_split[0]] + [course_dept + " " + code for code in courses_split[1:]]
    
    # format similar to ANTH 268, COGR 225A, HIGR 238, PHIL 209A, SOCG 255A
    elif bool(re.match("^[A-Z]+ \d+(, \w+ \d+\w*)+$", course_string)):
        expanded_courses = course_string.split(", ")
        
    # format similar to POLI 5 or 5D
    elif bool(re.match("^[A-Z]+ \d+ or \d+\w*$", course_string)):
        courses_split = course_string.split(" or ")
        course_dept = courses_split[0].split()[0]
        expanded_courses = [courses_split[0]] + [course_dept + " " + code for code in courses_split[1:]]
        
    return expanded_courses

# Example usage
print(split_courses("AAS/ANSC 185"))
print(split_courses("BGGN 249A-B-C"))
print(split_courses("CHIN 160/260"))
print(split_courses("EDS 31/CHEM 96"))
print(split_courses("GSS 21-22-23-25-26-27"))
print(split_courses("HMNR 101/ANSC 140 or COMM 114A"))
print(split_courses("LIGM 5A, 5B, 5C, 5D"))
print(split_courses("ANTH 268, COGR 225A, HIGR 238, PHIL 209A, SOCG 255A"))
print(split_courses("POLI 5 or 5D"))

['AAS 185', 'ANSC 185']
['BGGN 249A', 'BGGN 249B', 'BGGN 249C']
['CHIN 160', 'CHIN 260']
['EDS 31', 'CHEM 96']
['GSS 21', 'GSS 22', 'GSS 23', 'GSS 25', 'GSS 26', 'GSS 27']
['HMNR 101', 'ANSC 140', 'COMM 114A']
['LIGM 5A', 'LIGM 5B', 'LIGM 5C', 'LIGM 5D']
['ANTH 268', 'COGR 225A', 'HIGR 238', 'PHIL 209A', 'SOCG 255A']
['POLI 5', 'POLI 5D']


In [11]:
course_info = []

for dept in department_links:
    print(dept)
    # get content from department page
    response = requests.get(dept)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')

    # Get the course names and course descriptions
    course_name_elements = soup.find_all('p', class_='course-name')

    # Extract the course code, department, title, units, description and prerequisites
    for tag in course_name_elements:
        course_description_element = tag.find_next_siblings('p', limit=1)[0]

        content = response.content
    soup = BeautifulSoup(content, 'html.parser')

    # Get the course names and course descriptions
    course_name_elements = soup.find_all('p', class_='course-name')

    # Extract the course code, department, title, units, description and prerequisites
    for tag in course_name_elements:
        course_description_element = tag.find_next_siblings('p', limit=1)[0]

        full_course_name = tag.get_text().strip()
        full_course_description = course_description_element.get_text().strip()

        # Special Case: For Languages classes, the course code is formatted differently
        if full_course_name.startswith("Linguistics"):
            course_code = full_course_name.split('(')[1].split('.')[0].replace(')', '')
        else:
            if '.' in full_course_name:

                course_code = full_course_name.split('.')[0]
            else:

                course_code = ' '.join(full_course_name.split()[:2]).split('.')[0]

        course_dept = ' '.join(course_code.split()[:-1])

        # Special Case: Some courses don't have units listed
        if "(" in full_course_name:
            if "." in full_course_name:
                course_title = '('.join(full_course_name.split('(')[:-1]).split('.')[1].strip() # full_course_name.split('(')[-2].split('.')[1].strip()
            else:
                course_title = ' '.join(full_course_name.split(' ')[2:]).strip()
            course_units = re.findall(r'\((.*?)\)', full_course_name)[-1]
        else:
            course_title = ' '.join(full_course_name.split(' ')[2:]).strip()
            course_units = missing_units[course_code]

        course_description = full_course_description.split('Prerequisites:')[0].strip()

        if "Prerequisites" in full_course_description:
            course_prerequisites = full_course_description.split('Prerequisites:')[1].strip()
        else:
            course_prerequisites = "none"

        
        # if courses contain '/', 'or', '-', or ',', they need to be split to have one course per row
        if any(c in course_code for c in ['/', '-', ',', 'or']):
            expanded_courses = split_courses(course_code)
            for new_course_code in expanded_courses:
                course_dept = new_course_code.split()[0]
                course_info.append([new_course_code, course_dept, course_title, course_units, course_description, course_prerequisites])
        else:
            course_info.append([course_code, course_dept, course_title, course_units, course_description, course_prerequisites])

https://catalog.ucsd.edu/courses/AIP.html
https://catalog.ucsd.edu/courses/AASM.html
https://catalog.ucsd.edu/courses/AWP.html
https://catalog.ucsd.edu/courses/ANTH.html
https://catalog.ucsd.edu/courses/AAPI.html
https://catalog.ucsd.edu/courses/AUDL.html
https://catalog.ucsd.edu/courses/BIOI.html
https://catalog.ucsd.edu/courses/BIOL.html
https://catalog.ucsd.edu/courses/BIOM.html
https://catalog.ucsd.edu/courses/CHEM.html
https://catalog.ucsd.edu/courses/CLS.html
https://catalog.ucsd.edu/courses/CHIN.html
https://catalog.ucsd.edu/courses/CLAS.html
https://catalog.ucsd.edu/courses/CCS.html
https://catalog.ucsd.edu/courses/CSP.html
https://catalog.ucsd.edu/courses/CLIN.html
https://catalog.ucsd.edu/courses/CLRE.html
https://catalog.ucsd.edu/courses/COGS.html
https://catalog.ucsd.edu/courses/COMM.html
https://catalog.ucsd.edu/courses/css.html
https://catalog.ucsd.edu/courses/CGS.html
https://catalog.ucsd.edu/courses/CAT.html
https://catalog.ucsd.edu/courses/DSC.html
https://catalog.ucsd

In [31]:
def grad_or_undergrad(code):
    """
    Designates courses as graduate if course number >= 200, lower division if course number < 100, and upper division otherwise
    """
    course_num = int(re.findall(r'\d+', code.split()[-1])[0])
    if course_num < 200:
        if course_num < 100:
            return 'Lower Division'
        return 'Upper Division'
    return 'Graduate'

In [39]:
def capes_url(code):
    split_by_plus = '+'.join(code.split())
    return "https://cape.ucsd.edu/responses/Results.aspx?Name=&CourseNumber=" + split_by_plus

In [42]:
course_info_df['Code'].apply(capes_url)[1000]

'https://cape.ucsd.edu/responses/Results.aspx?Name=&CourseNumber=COGS+188'

In [64]:
# Store as Pandas DataFrame and drop duplicate rows + Electives row. 
course_info_df = pd.DataFrame(course_info, columns=['Code', 'Department', 'Title', 'Units', 'Description', 'Prerequisites']).drop_duplicates(subset=['Code']).reset_index(drop=True)
course_info_df = course_info_df[course_info_df['Code'] != 'Electives']

# Designates lower/upper divison/graduate and adds CAPEs url for each course
course_info_df = course_info_df.assign(Level=course_info_df['Code'].apply(grad_or_undergrad))
course_info_df = course_info_df.assign(URL=course_info_df['Code'].apply(capes_url)).set_index('Code')

# Replaces course descriptions for null course descriptions
course_info_df.at['CHEM 299', 'Description'] = 'none'
course_info_df.at['BENG 296', 'Description'] = 'Independent work by graduate students engaged in research and writing theses. (S/U grades only.)'
course_info_df.at['SE 296', 'Description'] = 'none'
course_info_df.at['JAPN 180', 'Description'] = 'none'
course_info_df.at['JAPN 190', 'Description'] = 'none'
course_info_df.at['MATS 296', 'Description'] = 'none'
course_info_df.at['NEU 298', 'Description'] = 'none'
course_info_df.at['PHYS 258', 'Description'] = 'Discussions of current research in astrophysics and space physics. (S/U grades only.)'
course_info_df.at['POLI 132', 'Description'] = 'Political development has dominated the study of comparative politics among US academicians since the revival of the Cold War in 1947. This course examines critically this paradigm and its Western philosophical roots in the context of the experience of modern China.'

course_info_df = course_info_df.reset_index()
course_info_df

Unnamed: 0,Code,Department,Title,Units,Description,Prerequisites,Level,URL
0,AIP 97,AIP,Academic Internship,"2, 4",Individual placements for field learning. Must...,"lower-division standing, completion of thirty ...",Lower Division,https://cape.ucsd.edu/responses/Results.aspx?N...
1,AIP 197,AIP,Academic Internship Program,"2, 4, 6, 8, 10, 12",Individual internship placements integrated wi...,upper-division standing; department approval.,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
2,AIP 197DC,AIP,"UCDC: Washington, DC Internship","6, 8, 10",This internship is attached to the University ...,upper-division standing; department approval.,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
3,AIP 197P,AIP,Public Service Internship,"4, 8, 12",Individual placements for field learning perfo...,ninety units completed; 2.5 minimum cumulative...,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
4,AIP 197T,AIP,Academic Internship Program—Special Programs,2,Individual placements for field learning assoc...,ninety units minimum completed; 2.5 minimum cu...,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
...,...,...,...,...,...,...,...,...
7164,WCWP 100,WCWP,Academic Writing,4,An upper-division workshop course in argumenta...,junior/senior standing and must be a Warren Co...,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
7165,WCWP 160,WCWP,Technical Writing for Scientists and Engineers,4,An upper-division workshop-style writing cours...,junior/senior standing.,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
7166,WARR 189,WARR,Academic Mentoring and the Writing Process,2,Students will gain a fundamental understanding...,permission of instructor is required to enroll.,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...
7167,WCWP 198,WCWP,Group Study,2,A directed group study involving research and ...,none,Upper Division,https://cape.ucsd.edu/responses/Results.aspx?N...


In [65]:
course_info_df.to_csv('data/course_catalog.csv', index=False)

## Dataset 3: UCSD Schedule of Classes
- Scraped for Winter 2024 because Spring 2024 is not available yet

In [145]:
def get_subject_links(main_page_url):
    response = requests.get(main_page_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the "Subjects" section
    # This depends on the structure of the main page
    # Adjust the selector as per the actual HTML structure
    subject_section = soup.find('div', {'id': 'subject_Panel'}) 

    subject_links = []
    if subject_section:
        links = subject_section.find_all('a')
        for link in links:
            href = link.get('href')
            if href and href.startswith('courseList.aspx?name=') and not href.endswith('dept=true'):
                full_link = main_page_url + href
                subject_links.append(full_link)

    return subject_links

In [146]:
def get_courses_from_subject(subject_url):
    response = requests.get(subject_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the section under "Select your Course:" header
    course_section = soup.find('h3', text='Select your Course:').find_next_sibling('ul')

    # Find all links within this section
    course_links = course_section.find_all('a') if course_section else []
    
    course_list = []
    for course in course_links:
        if 'coursemain' in course.get('href'):
            course = ' '.join(course.text.split()[:2])
            course_list.append(course)
    
    return list(set(course_list))

In [147]:
starting_letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W']
all_courses = []

# for each possible starting letter of the subjects
for letter in starting_letters:
    main_page_url = "https://courses.ucsd.edu/?u_letter=" + letter
    
    print(letter)
    
    # get all subjects starting with the letter
    subject_links = get_subject_links(main_page_url)
    
    # for each subject
    for subject_link in subject_links:
        subject_name = subject_link.split('=')[-1]
        
        # get all courses in that subject
        courses = get_courses_from_subject('https://courses.ucsd.edu/courseList.aspx?name=' + subject_name)
        all_courses += courses

A
B
C
D
E
F
G
H
I
J
L
M
N
O
P
R
S
T
U
V
W


In [148]:
len(all_courses)

2381

In [526]:
# Exporting Winter 2024 courses to a csv file
pd.DataFrame(all_courses, columns=['Course']).to_csv('data/winter2024.csv')

## Dataset 4: WebReg
- Scraped for Spring 2024
- Only contains courses from departments found in the course catalog

In [3]:
num_pages = 20
web_reg_courses = set()

for i in range(1, num_pages + 1):
    print("Working on page", i)
    file_path = "data/web_reg/sp24_web_reg_" + str(i) + ".html"
    
    # open the html file and read it in
    with open(file_path, 'r', encoding='utf-8') as file:
        web_reg_content = file.read()
        
    # parse the html file
    soup = BeautifulSoup(web_reg_content, 'html.parser')
    
    # get all the html tags that contain the course codes
    course_tags = soup.find_all(attrs={'id': 'search-group-header-id'}) # get all table tags
    
    # iterate through the tags and grab all the course codes and add to web_reg_courses
    for tag in course_tags:
        course = ' '.join(re.split('\s+', tag.find('tr').text.strip())[:2])
        web_reg_courses.add(course)

Working on page 1
Working on page 2
Working on page 3
Working on page 4
Working on page 5
Working on page 6
Working on page 7
Working on page 8
Working on page 9
Working on page 10
Working on page 11
Working on page 12
Working on page 13
Working on page 14
Working on page 15
Working on page 16
Working on page 17
Working on page 18
Working on page 19
Working on page 20


In [9]:
web_reg_courses = sorted(web_reg_courses)

In [13]:
pd.DataFrame(web_reg_courses).to_csv('data/sp24_web_reg.csv', index=False)