**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.



# COGS 108 - Data Checkpoint

# Names

- Emily Cai
- Jae Kim
- Peter Shamoun
- Viki Shi

# Research Question

-  How well does the UC San Diego computer science curriculum prepare students skill-wise for industry needs in comparison to the other UCs based on the skills reported by developers in industry surveys in the past 5 years?


## Background and Prior Work

## Introduction

Computer science programs are designed to prepare students for successful careers in technology. However, industry technologies and required skills evolve rapidly—often outpacing curriculum updates—resulting in graduates who may lack the hands-on, current technical competencies demanded by employers. With a challenging tech job market and increasingly scarce new-graduate roles, concerns have grown that undergraduate programs, including the UC San Diego computer science program and the broader University of California curriculum, are not adequately preparing students for today's industry needs.

This study evaluates the alignment between the UCSD and overall UC computer science curricula and the skills required in the tech industry. The analysis focuses on:
- **Curriculum Content:** Reviewing course descriptions and curriculum goals.
- **Industry Data:** Analyzing insights from recent Stack Overflow developer surveys and comparing them with current job market requirements.

**Research Objectives and Measurement:**
- **Intended Relationship:** The study seeks to determine the correlation between the curriculum’s technical content and the skills demanded by the tech industry. In particular, it examines whether courses covering emerging technologies like cloud computing, DevOps, containerization with Kubernetes and Docker adequately match the expertise employers seek.
- **Measurement Metrics:** 
  - Quantifying the percentage of courses at UCSD and other UC campuses that include training in in-demand technologies.
  - Comparing the success rate of UCSD graduates in securing tech roles with that of their peers from other UC institutions.
  - Evaluating the representation of practical skills in the curriculum against their frequency in industry job postings.
  - Using survey data to assess graduates’ proficiency in applied skills versus the theoretical emphasis in academic programs.

Prior studies have underscored significant gaps between academic preparation and industry requirements. For example:
- **[Closing the Gap between Software Engineering Education and Industrial Needs (Garousi et al., 2018)](https://arxiv.org/pdf/1812.01954)**  
  This review of 33 studies from 12 countries found that graduates often lack hands-on skills in cloud computing, DevOps, and modern software development practices. It concluded that while universities emphasize theoretical knowledge, employers prioritize applied and soft skills, such as teamwork and communication.
- **[The Gap between Higher Education and the Software Industry – A Case Study on Technology Differences (Dobslaw et al., 2023)](https://arxiv.org/pdf/2303.15597)**  
  This study highlighted a growing disparity between the academic courses that aspiring software engineers take and the practical work they perform on the job. It emphasized the lack of courses on cloud computing and related technologies, suggesting that universities update their curricula more frequently to remain relevant to industry demands.


# Hypothesis


UCSD is ranked third by US News and World Report among UC schools for computer science, so we expect that its curriculum is comparable, if not stronger than other UCs. We predict that UCSD's computer science curriculum prepares students as well as other UCs for industry needs, given its strong ranking. If significant gaps exist, they may be systemic across UC programs rather than specific to UCSD. However, if UCSD shows stronger alignment with industry standards, this may indicate that higher-ranked CS programs provide better preparation from academia to the workforce


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Stack Overflow Developer Surveys
  - Link to the dataset:https://survey.stackoverflow.co/
  - Number of observations: 65,000-80,000 each 
  - Number of variables: 116

We intend to use the the survey results from the past 4 years, from 2021-2024. For our research, the important variables on skills learned will include LanguageHaveWorkedWith, DatabaseHaveWorkedWith, PlatformHaveWorkedWith, WebframeHaveWorkedWith, ToolsTechHaveWorkedWith,MiscTechHaveWorkedWith, EmbeddedHaveWorkedWith and ProfessionalTech, which are all categorical, multi-select string data. Additionally, we want to understand respondents' skillsets in relation to their education level and where they learned these relevant skills, so we will also consider the variables EdLevel (single choice string), DevType (single choice string), Country(single choice string). 

Note that the csvs were extremely long and contained thousands of obervations, especially across 4 years. Therefore, we had to split each csv into two parts to fit into our github repo, and our dataset #1 is a merged dataframe on the above specified columns. 

- Dataset #2-10
  - Dataset Name:UC CS Courses
  - Number of observations: 49-100 each
  - Number of variables: 5

We webscraped every CS class from the 9 undergrad UCs, collecting course IDs, titles, and descriptions (strings), whether they were upper or lower div (boolean), and used a python package to extract keywords(list of strings) 

We plan to merge the UC CS Courses together into one dataframe, as we split up the webscraping and collection amongst members

## Stack Overflow Data

In [6]:
import pandas as pd 
pd.set_option('display.max_columns', None)

#2021 survey 
stack_2021_a = pd.read_csv("stack_overflow_survey_2021_1.csv")
stack_2021_b = pd.read_csv("stack_overflow_survey_2021_2.csv")

stack_2021 = pd.concat([stack_2021_a, stack_2021_b], ignore_index=True)

cols = ["Country", "EdLevel", "DevType", "LanguageHaveWorkedWith", 
        "DatabaseHaveWorkedWith", "PlatformHaveWorkedWith", "WebframeHaveWorkedWith", 
        "ToolsTechHaveWorkedWith", "MiscTechHaveWorkedWith"]

stack_2021 = stack_2021[cols]

#2022 survey 
stack_2022_a = pd.read_csv("stack_overflow_survey_2022_1.csv")
stack_2022_b = pd.read_csv("stack_overflow_survey_2022_2.csv")

stack_2022 = pd.concat([stack_2022_a, stack_2022_b], ignore_index=True)

cols = ["Country", "EdLevel", "DevType", "LanguageHaveWorkedWith", 
        "DatabaseHaveWorkedWith", "PlatformHaveWorkedWith", "WebframeHaveWorkedWith", 
        "ToolsTechHaveWorkedWith", "MiscTechHaveWorkedWith"]

stack_2022 = stack_2022[cols]

#2023 survey 
stack_2023_a = pd.read_csv("stack_overflow_survey_2023_1.csv")
stack_2023_b = pd.read_csv("stack_overflow_survey_2023_2.csv")

stack_2023 = pd.concat([stack_2023_a, stack_2023_b], ignore_index=True)

cols = ["Country", "EdLevel", "DevType", "LanguageHaveWorkedWith", 
        "DatabaseHaveWorkedWith", "PlatformHaveWorkedWith", "WebframeHaveWorkedWith", 
        "ToolsTechHaveWorkedWith", "MiscTechHaveWorkedWith"]

stack_2023 = stack_2023[cols]

#2024 survey 
stack_2024_a = pd.read_csv("stack_overflow_survey_2024_1.csv")
stack_2024_b = pd.read_csv("stack_overflow_survey_2024_2.csv")

stack_2024 = pd.concat([stack_2024_a, stack_2024_b], ignore_index=True)

cols = ["Country", "EdLevel", "DevType", "LanguageHaveWorkedWith", 
        "DatabaseHaveWorkedWith", "PlatformHaveWorkedWith", "WebframeHaveWorkedWith", 
        "ToolsTechHaveWorkedWith", "MiscTechHaveWorkedWith"]

stack_2024 = stack_2024[cols]

UC Datasets

In [None]:
#UC Berkeley
import requests
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm
from keybert import KeyBERT

tqdm.pandas()

berk = "https://guide.berkeley.edu/courses/compsci/"
berk_req = requests.get(berk)
soup = BeautifulSoup(berk_req.text, "html.parser")

divs = soup.find_all("div", class_="courseblock")


class_id = []
class_title = []
desc = []
upper = []

for div in divs:
    heading = div.find("p", class_="course-heading")

    if heading:
        course_code = div.find("span", class_="code")
        course_title = div.find("span", class_="title")
        course_desc = div.find(class_="courseblockdesc")

        course_details = div.find_all("p")
        is_undergrad = any("Undergraduate" in p.text for p in course_details)

        if is_undergrad:  
            course_id_text = course_code.text.strip()
            class_id.append(course_id_text)
            class_title.append(course_title.text.strip())
            desc.append(course_desc.text.split('\n')[1])

           
            course_number = int("".join(filter(str.isdigit, course_id_text)))  # Extract numeric part
            is_upper = course_number >= 100
            upper.append(is_upper)  


df = pd.DataFrame({
    "Course ID": class_id,
    "Course Title": class_title,
    "Course Description": desc,
    "Upper Div": upper  
})


def keyword_wrapper(doc):
    kw_model = KeyBERT()
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work"] #dont consider these words  # Extract top 10 keywords


df['keywords'] = df['Course Description'].progress_apply(keyword_wrapper)

100%|██████████| 48/48 [01:05<00:00,  1.37s/it]


In [8]:
#UC Merced

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import time
from tqdm import tqdm

from bs4 import BeautifulSoup
import requests
import re

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


from keybert import KeyBERT

UCM_home_url = "https://catalog.ucmerced.edu/content.php?filter%5B27%5D=CSE&filter%5B29%5D=&filter%5Bkeyword%5D=&filter%5B32%5D=1&filter%5Bcpage%5D=1&cur_cat_oid=23&expand=&navoid=2517&search_database=Filter#acalog_template_course_filter"

response = requests.get(UCM_home_url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    links = [soup.find_all("td", class_="width")[i].find_all('a')[0] for i in range(len(soup.find_all("td", class_="width")))]
    coids = [links[i]['href'][-5:] for i in range(len(links))]
    full_links = [f'https://catalog.ucmerced.edu/preview_course_nopop.php?catoid=23&coid={course}' for course in coids]
else:
    print('Request failed:', response.status_code)

data = []
for url in full_links:
    resposne = requests.get(url)
    if response.status_code == 200:

        driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

        driver.get(url)

        time.sleep(1)

        rendered_html = driver.page_source
        
        driver.quit()

        soup = BeautifulSoup(rendered_html, "html.parser")

        soup = soup.find('td', class_ = 'block_content')

        header = soup.find("h1", id="course_preview_title")

        header_text = header.get_text(strip=True)
        course_code, course_title = [part.strip() for part in header_text.split(":", 1)]

        course_description = ""
        for br in soup.find_all("br"):
            next_text = br.next_sibling
            if next_text and isinstance(next_text, str):
                cleaned = next_text.strip()
                if cleaned and "Unit" not in cleaned:
                    course_description = cleaned
                    break

        result = [course_code, course_title, course_description]
        data.append(result)

    
    else:
        print('Request failed:', response.status_code)

df = pd.DataFrame(columns = ['Course ID', 'Course Title', 'Description'], data=data)

df['Upper Div'] = df['Course ID'].str.extract(r'(\d+)')[0].astype(int).apply(lambda x: x >= 100) #Upper div class is 100-199 class

start = time.time() #time it
tqdm.pandas() #time it

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['keywords'] = df['Description'].progress_apply(keyword_wrapper) #apply functon
df.head()
end = time.time()
print('Time:', end - start) #print time

df= df[['Course ID',	'Course Title',	'Upper Div',	'keywords']]
df.columns = ['Course ID', 'Course Title', 'Upper', 'Skills']

100%|██████████| 43/43 [00:58<00:00,  1.37s/it]

Time: 58.812450885772705





In [9]:
#UCD 
ucd_url = "https://catalog.ucdavis.edu/courses-subject-code/ecs/"
req = requests.get(ucd_url)
soup = BeautifulSoup(req.text)

divs = soup.find_all("div", class_="courseblock")

class_id = []
class_title = []
desc = []


for div in divs:
    code_element = div.find("span", class_="text courseblockdetail detail-code margin--span text--semibold text--big")
    title_element = div.find("span", class_="text courseblockdetail detail-title margin--span text--semibold text--big")
    desc_element = div.find("p", class_= "courseblockextra noindent")

    class_id.append(code_element.text.strip())
    class_title.append(title_element.text.strip().replace('—', ""))
    desc.append(desc_element.text.split("Course Description:")[1])

df = pd.DataFrame({"Course ID": class_id, "Course Title": class_title, "Course Description":desc})

df["Upper Div"] = df["Course ID"].str.extract(r'(\d+)')[0].astype(int).apply(lambda x:x>=100)
df = df[df["Course ID"].str.extract(r'(\d+)')[0].astype(int) < 200]

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
                  "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['keywords'] = df['Course Description'].progress_apply(keyword_wrapper) #apply functon


100%|██████████| 99/99 [03:02<00:00,  1.84s/it]


In [10]:
#UC Irvine 
uci_url = 'https://catalogue.uci.edu/allcourses/compsci/'
req = requests.get(uci_url)
soup = BeautifulSoup(req.text)
divs = soup.find_all("div", class_="courseblock")

class_id = []
class_title = []
desc = []
upper = []

for div in divs:
    # Extract Course ID and Title
    title_element = div.find("p", class_="courseblocktitle")
    if title_element:
        full_title = title_element.get_text(strip=True)
        course_code, course_name = full_title.split('.', 1)
        course_number = int("".join(filter(str.isdigit, course_code)))
        if course_number < 200:
            class_id.append(course_code.strip())
            class_title.append(course_name.split('.')[0].strip())

            desc_element = div.find("div", class_="courseblockdesc")
            desc_text = []
            if desc_element:
                for p in desc_element.find_all("p"):
                    desc_text.append(p.get_text(strip=True))
            desc.append(" ".join(desc_text))
            if course_number >= 100:
                upper.append(True)
            else:
                upper.append(False)

df = pd.DataFrame({
    "Course ID": class_id,
    "Course Title": class_title,
    "Upper": upper,
    "Skills": desc
})

def clean_description(text):
    text = text.split("Prerequisite:")[0]
    text = text.split("Restriction:")[0]
    return text.strip()

df['Skills']= df['Skills'].apply(clean_description)

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
                  "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors", "students"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['Skills'] = df['Skills'].apply(keyword_wrapper) #apply functon


In [None]:
#UC Riverside
UCR_url = "https://www1.cs.ucr.edu/undergraduate/course-descriptions"

response = requests.get(UCR_url)

if response.status_code == 200:
    
    soup = BeautifulSoup(response.text, "html.parser")

    tables = soup.find_all("table", class_="ui yellow definition striped table")

    header = tables[0].find_all('tr')[0].text.split('\n')[1:4]

    data = [tables[0].find_all('tr')[i].text.split('\n')[1:4] for i in range(1, len(tables[0].find_all('tr')))]

    df = pd.DataFrame(data, columns = header)

    
    display(df.head())

else:
    print('response failed:', response.status_code)

    df = df[df['Course'].str[:2]==('CS')].reset_index().drop(columns = ['index']) #remove non-cs courses

df['Upper Div'] = df['Course'].str.extract(r'(\d+)')[0].astype(int).apply(lambda x: x >= 100) #Upper div class is 100-199 class


def remove_prereq(doc):
    doc = re.sub(r'^.*?Prerequisite\(s\):.*?(\.|\n)', '', doc, flags=re.DOTALL).strip() #remove prerequisites
    
    #for first class
    doc = re.sub(r'4 Units, Lecture, 3 hours; laboratory,2 hours; individual study, 1 hour\.', '', doc).strip()
    
    return doc



df['Description'] = df['Description'].apply(remove_prereq)

start = time.time() #time it
tqdm.pandas() #time it

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['keywords'] = df['Description'].progress_apply(keyword_wrapper) #apply functon
df.head()
end = time.time()
print('Time:', end - start) #print time

df = df.drop(columns = ['Description'])
df.columns = ['Course ID', 'Course Title', 'Upper', 'Skills']

Unnamed: 0,Course,Course Title,Description
0,ENGR 001,Professional Development and Mentoring,"1 Unit, Activity, 30 hours per quarter. Provid..."
1,ENGR 101,Professional Development and Mentoring,"1 Unit, Activity, 30 hours per quarter. Prereq..."
2,ENGR 180W,Technical Communications,"4 Units, Lecture, 3 hours; workshop, 3 hours. ..."
3,CS 005,Introduction to Computer Programming,"4 Units, Lecture, 3 hours; laboratory,2 hours;..."
4,CS 006,Effective Use of the World Wide Web,"4 Units, Lecture, 3 hours; laboratory, 3 hours..."


100%|██████████| 60/60 [01:48<00:00,  1.81s/it]

Time: 108.90000176429749





In [12]:
#UCSD
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import time
from tqdm import tqdm

from bs4 import BeautifulSoup
import requests
import re


from keybert import KeyBERT
UCR_url = "https://catalog.ucsd.edu/courses/CSE.html"

response = requests.get(UCR_url)

if response.status_code == 200:
    
    soup = BeautifulSoup(response.text, "html.parser")

    div = soup.find('div', class_ = "col-md-12 blank-slate")

    courses = [i.text.strip().split('.') for i in div.find_all('p', class_ = 'course-name')]
    courses = [course for course in courses if int(re.search(r'\d+', course[0]).group()) < 200]

    for course in courses:
        course[1] = re.sub(r'\s*\(\d+.*$', '', course[1])
    
    descriptions = [i.text for i in div.find_all('p', class_ = 'course-descriptions')]

    filtered_descriptions = [
    description.split('Prerequisites')[0]
    for name, description in zip(courses, descriptions)
    if int(re.search(r'\d+', name[0]).group()) < 200
]

    data = [courses[i] +  [filtered_descriptions[i]] for i in range(len(courses))]

    
    df = pd.DataFrame(columns = ['Course ID', 'Course Title', 'Description'], data=data)
    

else:
    print('response failed:', response.status_code)

df['Upper Div'] = df['Course ID'].str.extract(r'(\d+)')[0].astype(int).apply(lambda x: x >= 100) #Upper div class is 100-199 class


start = time.time() #time it
tqdm.pandas() #time it

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['keywords'] = df['Description'].progress_apply(keyword_wrapper) #apply functon

end = time.time()
print('Time:', end - start) #print time

df = df.drop(columns=['Description'])
df['Skills'] = df['keywords']
df = df.drop(columns=['keywords'])

100%|██████████| 95/95 [02:57<00:00,  1.87s/it]

Time: 177.55299544334412





In [13]:
#UCLA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import time
from tqdm import tqdm

from bs4 import BeautifulSoup
import requests
import re

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


from keybert import KeyBERT

url = "https://registrar.ucla.edu/academics/course-descriptions?search=COM+SCI"


driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

driver.get(url)

time.sleep(1)

rendered_html = driver.page_source

driver.quit()

soup = BeautifulSoup(rendered_html, "html.parser")


upper_div_section = soup.find('div', {'aria-labelledby': 'upper-division-courses-51-1'})
    
lower_div_section = soup.find('div', {'aria-labelledby': 'lower-division-courses-10-1'})

undergrad_courses = []

if lower_div_section:
        course_records = lower_div_section.find_all('div', class_='course-record')
        for record in course_records:
            title_element = record.find('h3')
            description_paragraphs = record.find_all('p')

            if title_element and description_paragraphs:
                full_title = title_element.text.strip()
                parts = full_title.split('.', 1)
                if len(parts) == 2:
                    course_code = "CS " + parts[0].strip()
                    course_title = parts[1].strip()
                else:
                    course_code = "CS Unknown"
                    course_title = full_title
                description = '\n'.join(p.text.strip() for p in description_paragraphs[1:])
                undergrad_courses.append([course_code, course_title, description])

if upper_div_section:
    course_records = upper_div_section.find_all('div', class_='course-record')
    for record in course_records:
        title_element = record.find('h3')
        description_paragraphs = record.find_all('p')

        if title_element and description_paragraphs:
            full_title = title_element.text.strip()
            parts = full_title.split('.', 1)
            if len(parts) == 2:
                course_code = "CS " + parts[0].strip()
                course_title = parts[1].strip()
            else:
                course_code = "CS Unknown"
                course_title = full_title
            description = '\n'.join(p.text.strip() for p in description_paragraphs[1:])
            undergrad_courses.append([course_code, course_title, description])


df = pd.DataFrame(columns = ['Course ID', 'Course Title', 'Description'], data = undergrad_courses)



def extract_description(text):
    try:
        parts = text.split('.', 1) 
        if len(parts) < 2: 
          return text
        description_start = parts[1].strip()

        if "grading" in description_start.lower():
            description_parts = description_start.split("grading")
            core_description = description_parts[0].strip()
        else:
            core_description = description_start

        return core_description

    except:
        return text

df['Description'] = df['Description'].apply(extract_description)

df['Upper Div'] = df['Course ID'].str.extract(r'(\d+)')[0].astype(int).apply(lambda x: x >= 100) #Upper div class is 100-199 class

start = time.time() #time it
tqdm.pandas() #time it

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work", "department", "resources", "requisite", "requisites", "enforced", "lecture", "hours"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['keywords'] = df['Description'].progress_apply(keyword_wrapper) #apply functon
df.head()
end = time.time()
print('Time:', end - start) #print time

df = df.drop(columns=['Description'])

df['Skills'] = df['keywords']
df = df.drop(columns=['keywords'])



100%|██████████| 61/61 [01:37<00:00,  1.60s/it]

Time: 97.55204701423645





In [14]:
#ucsc

import pandas as pd

from bs4 import BeautifulSoup
import requests

from keybert import KeyBERT
from tqdm import tqdm
import re

ucsc_url = 'https://registrar.ucsc.edu/catalog/archive/11-12/programs-courses/course-descriptions/cmpscourses.html'
req = requests.get(ucsc_url)
soup = BeautifulSoup(req.text)

course_blocks = soup.find_all("p")

class_id = []
class_title = []
desc = []
upper = []
def clean_description(text):
    text = text.replace('(2 credits)','')
    text = text.split("Students cannot")[0]
    text = text.split("Prerequisite(s):")[0] 
    text = text.split("(General Education Codes(s):")[0]
    text = text.replace("F,W,S","").replace("F,W","").replace("W,S","").replace("F,S","").replace("*","")
    return text.strip()

for i in range(len(course_blocks)):
    course_text = course_blocks[i].get_text(strip=True)
    if re.match(r"^\d+[A-Z]?\.", course_text):
        split_text = course_text.split(".", 1)
        course_code = split_text[0].strip() 
        course_name = split_text[1].strip() 

        course_number = int("".join(filter(str.isdigit, course_code)))

        if course_number < 200:
            class_id.append(course_code)
            class_title.append(clean_description(course_name).split('.')[0])
            desc_element = course_blocks[i + 1].get_text(strip=True) if i + 1 < len(course_blocks) else ""
            desc.append(clean_description(desc_element).split('.')[2])
            upper.append(course_number >= 100)



df = pd.DataFrame({
    "Course ID": class_id,
    "Course Title": class_title,
    "Upper": upper,
    "Skills": desc
})
def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work", "department", "resources", "requisite", "requisites", "enforced", "lecture", "hours"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['Skills'] = df['Skills'].apply(keyword_wrapper) #apply functon


In [15]:
#ucsb

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import time
from tqdm import tqdm

from bs4 import BeautifulSoup
import requests
import re

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


from keybert import KeyBERT

url = "https://cs.ucsb.edu/education/courses/course-descriptions"

response = requests.get(url)

if response.status_code == 200:
    
    soup = BeautifulSoup(response.text, "html.parser")

    table = soup.find('table', class_ = "table table-hover table-striped").find_all('td', class_ = "views-field views-field-title")


    urls = ["https://cs.ucsb.edu"+i.find('a')['href'] for i in table]

else:
    print('response failed:', response.status_code)


metadata_keys = ["Prerequisite", "Enrollment Comments", "Repeat Comments"]

data = []

def clean_paragraph(text):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    cleaned_sentences = []
    for sentence in sentences:
        sentence = sentence.strip()
        if any(sentence.startswith(key) for key in metadata_keys):
            continue
        if sentence:  
            cleaned_sentences.append(sentence)
    return " ".join(cleaned_sentences)

for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    under_grad = soup.find_all('div', class_="field--item")
    
    if under_grad[3].text == "Undergraduate":
        course = under_grad[1].text
        title = soup.find('h1', class_="page-header").text.strip()
        desc_tags = soup.find('div', class_="field field--name-field-course-des field--type-text-long field--label-above") \
                        .find('div', class_="field--item") \
                        .find_all("p")
        
        cleaned_descs = []
        for p in desc_tags:
            text = p.get_text(" ", strip=True)
            if any(text.startswith(key) for key in metadata_keys):
                cleaned_text = clean_paragraph(text)
            else:
                cleaned_text = text
            if cleaned_text:
                cleaned_descs.append(cleaned_text)
                
        final_desc = " ".join(cleaned_descs)

        result = [course, title, final_desc]
        data.append(result)


df = pd.DataFrame(columns = ['Course ID', 'Course Title', 'Description'], data=data)

df['Upper Div'] = df['Course ID'].str.extract(r'(\d+)')[0].astype(int).apply(lambda x: x >= 100) #Upper div class is 100-199 class


start = time.time() #time it
tqdm.pandas() #time it

def keyword_wrapper(doc):
    kw_model = KeyBERT() #instantiate model
    
    stop_words = ["cs", "prerequisite", "grade", "requirement", 
    "courses", "instructor", "faculty", "computer", "student", "concurrently", "majors",
    "approach", "aspects", "awarded",
    "concepts", "course", "courses", "credit", "design", "fields",
    "foundation", "fundamental", "fundamentals", "introduction", "issues", "level",
    "lower", "major", "methods", "none", "overview", "perspectives",
    "practice", "practices", "principles", "process", "processes",
    "programs", "related", "required", "requirement", "role",
    "skills", "study", "techniques", "tools", "topics", "understanding",
    "upper", "various", "work"] #dont consider these words

    return [i[0] for i in kw_model.extract_keywords(doc, stop_words=stop_words, top_n=10)] #top 10 keywords

df['keywords'] = df['Description'].progress_apply(keyword_wrapper) #apply functon
df.head()
end = time.time()
print('Time:', end - start) #print time


df = df.drop(columns=['Description'])
df['Skills'] = df['keywords']
df = df.drop(columns=['keywords'])


100%|██████████| 50/50 [01:12<00:00,  1.45s/it]

Time: 72.50465130805969





# Ethics & Privacy

Concluding that UCSD does not adequately prepare students for the workforce could negatively impact the university's reputation and discourage prospective students. This may be a misinterpretation of the university's ability to prepare students, as it is a large research university, and many skills can be learned through on site research programs that teach technologies not outlined in the curriculum. 

PRIVACY: There are no significant privacy concerns in this research as the primary data—university curriculums, reported technologies, and publicly available surveys (e.g., Stack Overflow)—is already accessible to the public. 

BIASES: 

Funding Disparities : Universities with more funding may offer more thorough curriculums, teaching more languages, libraries, and frameworks than others. Thus, they have a higher correlation to tools used in jobs, skewing the analysis toward their favor. The same holds for UCs with higher prestige / reputations. 

However, UCSD is not the most nor least reputable UC. As a result, the more / less prestugous may even out to dappen the effect of funding. 

Bias in Survey Respondents: 
Our main source of data comes from the Stack Overflow 2024 Developer Survey, in which 65,000 respondents from 185 countries answered questions. Our question analyzes the curriculums of UCs, and it is safe to assume most UC CS graduates remain within the country. Therefore, respondents from the other 184 countries may use different technologies, some of which may not be taught in UC curriculums. 

Grouping UCs for Curriculum Averages: Taking averages of curriculums across UC campuses could mask differences in quality between individual programs, making the grouped UCs appear stronger. 

Inclusion of Non-CS Majors in Job Fields: 
The technologies that developers reported arent necessarily used by only computer science majors. Individuals from related majors (Data Science, Mathematics, Computer Engineering),  could have reported their skills, and thus wont align with what CS majors need to know on the job. As a solution, if possible, we should filter data to include only CS major - reported technologies.


# Team Expectations 


* We will discuss through our group discord to discuss anything related to the project.
* If we have issues or disagreements, we will communicate politely and through meaningful group discussions.
* We will divide up the work equally and all work on our part with care
* Jae Kim, Peter Shamoun, Emily Cai, Viki Shi

# Project Timeline Proposal




| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/05  |  3 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Make Discord Server to communicate; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/09  |  7 PM |  Look for potential topics | Discuss ideal datasets and ethics; submit project proposal | 
| 2/16  | 7 PM  | Look for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/21  | 2 PM  | Import & Wrangle Data | Review/Edit data; Complete Checkpoint #1- data   |
| 3/09  | 7 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete Checkpoint#2- EDA |
| 3/15  | 7 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |