In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

## PRELIMINARY STEPS
* get the url
* use the requests.get function to get the html
* Use BeautifulSoup to parse the html
* print the soup obj

In [2]:

url ='https://cs.nyu.edu/dynamic/courses/schedule/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soup.prettify()

'<!DOCTYPE html>\n<!--[if lt IE 7]>\n<html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en" data-theme="light"> <![endif]-->\n<!--[if IE 7]>\n<html class="no-js lt-ie9 lt-ie8" lang="en" data-theme="light"> <![endif]-->\n<!--[if IE 8]>\n<html class="no-js lt-ie9" lang="en" data-theme="light"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="no-js" data-theme="light" lang="en">\n <!--<![endif]-->\n <head>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta charset="utf-8"/>\n  <title>\n   NYU Computer Science Department\n  </title>\n  <meta content="" name="description"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <link href="//maxcdn.bootstrapcdn.com/bootstrap/3.3.0/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>\n  <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>\n  <link href="/static/css/main.css?20180702b" rel="stylesheet" type="text/css"/>\n  <script src="/static/js/vendor/m

## FINDING THE TABLE
* use the find_all function to find the correct unordered list
* print the course list

In [3]:
course_li=soup.find_all('ul')[4]
course_li

<ul class="schedule-listing">
<li class="row" id="csci-ga1170-001">
<div>
<span class="col-xs-12 col-sm-3 italic" style="word-wrap:break-word;">
<a data-placement="left" data-toggle="tooltip" href="https://cs.nyu.edu/courses/spring25/CSCI-GA.1170-001/" title="NYU Home login required to view course page">CSCI-GA.1170-​001</a>
<br/>DS-GA.1170-001
                            <br/>(5115)

                        </span>
<span class="col-xs-12 col-sm-3">
<a class="expand" data-toggle="collapse" href="#csci-ga1170-001-desc">
                            
                            
                            
                              
                              Fundamental Algorithms</a> 
                        </span>
<span class="col-xs-12 col-sm-2">
<a href="http://cs.nyu.edu/cs/faculty/yap/">Chee Yap</a>
</span>
<span class="col-xs-12 col-sm-2">W 4:55-6:55PM</span>
<span class="col-xs-12 col-sm-2">CIWW 109</span>
<span class="col-xs-12 collapse expandable" id="csci-ga1170-001-d

## COLLECTING DATA
* get the rows by finding all the lists in the course list
* make a dictionary to hold the data
* iterate over the rows
    * find all the spans which is where the data is located
    * index the spans at index 0 to find the Number Section
        * strip and split based on the newline and take the first value in the splitted list
        * THIS TAKES ONLY THE CS COURSES, NOT THE DS COURSES TO ENSURE THAT THE VALUES IN THE DICT LINE UP BY INDEX
        * append the Number Section value to the dict and replace the invisible space char
    * Find the course title by indexing the spans at index 1 and strip
        * replace the newlines and spaces in the name with a space
        * append to the dict list
    * Find the prof name by indexing spans at index 3
        * append to the dict list
    * Find the time of the class by indexing spans at index 4
        * append to dict list
    * Find the Course Number by selecting the first 12 chars of the course_section using regex
        * append to dict list      

In [4]:
rows = course_li.find_all('li')
data = {"Number Section":[],
       "Name":[],
       'Instructor':[],
        "Time" : [],
        "Number":[]
       }
for row in rows:
    spans = row.find_all('span')
    course_section=spans[0].text.strip().split('\n')[0]
    data["Number Section"].append(course_section.replace('\u200b',""))
    course_title= spans[1].text.strip()
    course_title=re.sub(r'\n\s+',' ', course_title)
    data["Name"].append(course_title)
    prof = spans[2].text.strip()
    data["Instructor"].append(prof)
    times = spans[3].text.strip()
    data["Time"].append(times)
    num=re.search(r'[A-Za-z0-9-\.]{12}', course_section).group()
    data["Number"].append(num)

In [5]:
schedule = pd.DataFrame(data)
schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Number Section  177 non-null    object
 1   Name            177 non-null    object
 2   Instructor      177 non-null    object
 3   Time            177 non-null    object
 4   Number          177 non-null    object
dtypes: object(5)
memory usage: 7.0+ KB


In [6]:
schedule.head()

Unnamed: 0,Number Section,Name,Instructor,Time,Number
0,CSCI-GA.1170-001,Fundamental Algorithms,Chee Yap,W 4:55-6:55PM,CSCI-GA.1170
1,CSCI-GA.1170-002,Fundamental Algorithms Recitation,Bingwei Zhang,R 5:55-6:45PM,CSCI-GA.1170
2,CSCI-GA.1170-003,Fundamental Algorithms Recitation,Bingwei Zhang,F 4:55-5:45PM,CSCI-GA.1170
3,CSCI-GA.1170-004,Fundamental Algorithms,Yevgeniy Dodis,T 4:55-6:55PM,CSCI-GA.1170
4,CSCI-GA.1170-005,Fundamental Algorithms Recitation,Eli Goldin,R 3:45-4:35PM,CSCI-GA.1170


In [7]:
schedule.tail()

Unnamed: 0,Number Section,Name,Instructor,Time,Number
172,CSCI-UA.0480-051,Special Topics: Parallel Computing,Mohamed Zahran,TR 2:00-3:15PM,CSCI-UA.0480
173,CSCI-UA.0480-061,Special Topics: Open Source Software Development,Joanna Klukowska,MW 12:30-1:45PM,CSCI-UA.0480
174,CSCI-UA.0480-069,Special Topics: Agile Software Development and...,Amos Bloomberg,MW 11:00-12:15PM,CSCI-UA.0480
175,CSCI-UA.0480-075,Special Topics: Introduction to Deep Learning,Alfredo Canziani,TR 12:30-1:45PM,CSCI-UA.0480
176,CSCI-UA.0490-001,Special Topics in Programming Languages,Benjamin Goldberg,MW 3:30-4:45PM,CSCI-UA.0490


In [8]:
schedule.sample(5)

Unnamed: 0,Number Section,Name,Instructor,Time,Number
75,CSCI-UA.0002-001,Intro To Computer Programming (No Prior Experi...,Amanda Steigman,MW 8:00-9:15AM,CSCI-UA.0002
56,CSCI-GA.3033-999,Special Topics: Introduction to Computer Visio...,STAFF,R 4:55-5:45PM,CSCI-GA.3033
131,CSCI-UA.0201-002,Computer Systems Organization - Recitation,Anway Agte,F 9:30-10:45AM,CSCI-UA.0201
23,CSCI-GA.2572-002,Deep Learning Lab,TBA,W 3:45-4:35PM,CSCI-GA.2572
82,CSCI-UA.0002-008,Intro To Computer Programming (No Prior Experi...,Emily Zhao,TR 12:30-1:45PM,CSCI-UA.0002


## CATALOG

## PRELIMINARY STEPS
* get the url
* use requests.get to get the html
* make a soup obj by parsing the html

In [23]:
url ='https://cs.nyu.edu/dynamic/courses/catalog/'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')


## Finding the table
* find all of the unordered lists and select the one at the second index because it contains the correct info
* print the table

In [10]:
table = soup.find_all('ul')[2]
table

<ul class="courses-listing">
<li class="col-sm-12" id="csci-ga1170">
<p class="bold heading uppercase">
                                                
                                                    CSCI-GA.1170 Fundamental Algorithms
                                                
                                            </p>
<p class="bold">
                                                3 Points.
                                                Graduate-level.
                                                Fall
                                                    , 
                                                Spring,
                                                
                                                Summer.
                                            </p>
<p class="bold">Prerequisites: At least one year of experience with a high-level language such as Pascal, C, C++, or Java; and familiarity with recursive programming methods and with data structures (arra

## Find the rows
* get all the rows which contain the information for the DF
* print the first row to examine the info

In [11]:
rows = table.find_all("li")
rows[0]

<li class="col-sm-12" id="csci-ga1170">
<p class="bold heading uppercase">
                                                
                                                    CSCI-GA.1170 Fundamental Algorithms
                                                
                                            </p>
<p class="bold">
                                                3 Points.
                                                Graduate-level.
                                                Fall
                                                    , 
                                                Spring,
                                                
                                                Summer.
                                            </p>
<p class="bold">Prerequisites: At least one year of experience with a high-level language such as Pascal, C, C++, or Java; and familiarity with recursive programming methods and with data structures (arrays, pointers, stacks, queues,

## COLLECT DATA
* make an empty dict to append to
* iterate over the rows
    * find all of the paragraphs which contain the info we need
        * find the course number by indexing at the 0th index, strip, split by spaces, and take the 0th element again to only get the course number
            * append to the dict list
        * find all the points by indexing paragraphs at the 1st index, strip, split by newline, select the 0th index, split again, and select the 0th index again to only get the number...no text
            * append to dict list
        * find the prereqs by indexing at the 2nd index, strip, and slice to remove the word "Prerequisites: " from the value
            * append to dict list    

In [12]:
data = {"Number":[],
       "Prereqs":[],
       "Points":[]}
for row in rows:
    paragraphs = row.find_all('p')
    course_num = paragraphs[0].text.strip().split(" ")[0]
    data["Number"].append(course_num)
    points = paragraphs[1].text.strip().split('\n')[0].split(" ")[0]
    data["Points"].append(points)
    prereqs = paragraphs[2].text.strip()[15:]
    data["Prereqs"].append(prereqs)



In [13]:
catalog = pd.DataFrame(data)
catalog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Number   96 non-null     object
 1   Prereqs  96 non-null     object
 2   Points   96 non-null     object
dtypes: object(3)
memory usage: 2.4+ KB


In [14]:
catalog.head()

Unnamed: 0,Number,Prereqs,Points
0,CSCI-GA.1170,At least one year of experience with a high-le...,3
1,CSCI-GA.1180,,3
2,CSCI-GA.2110,Students taking this class should already have...,3
3,CSCI-GA.2112,Multivariate calculus and linear algebra. Some...,3
4,CSCI-GA.2130,"CSCI-GA 1170, CSCI-GA 2110, and CSCI-GA 2250.",3


In [15]:
catalog.tail()

Unnamed: 0,Number,Prereqs,Points
91,CSCI-UA.0897,Restricted to declared computer science majors...,1
92,CSCI-UA.0898,Restricted to declared computer science majors...,1
93,CSCI-UA.0997,Permission of the department. Does not satisfy...,1
94,CSCI-UA.0998,Permission of the department. Does not satisfy...,1
95,FRSEM-UA.0597,"Some programming experience in Python, Java, J...",4


In [16]:
catalog.sample(5)

Unnamed: 0,Number,Prereqs,Points
91,CSCI-UA.0897,Restricted to declared computer science majors...,1
8,CSCI-GA.2262,Students must have a working knowledge of fund...,3
15,CSCI-GA.2421,Corequisite: linear algebra.,3
53,CSCI-GA.3812,For MS in IS students: Successful completion o...,3
58,CSCI-GA.3870,Permission of Director of Graduate Studies.,1-3


## MERGING
* merge the first df with the second using left merge to maintain the first db's rows
* reindex the new df to ensure I have the correct order of rows

In [27]:
df = schedule.merge(catalog, how='left')
df= df.reindex(columns=["Number", "Name",'Instructor', 'Time', "Prereqs", "Points"])
pd.set_option('display.max_rows',None)
df

Unnamed: 0,Number,Name,Instructor,Time,Prereqs,Points
0,CSCI-GA.1170,Fundamental Algorithms,Chee Yap,W 4:55-6:55PM,At least one year of experience with a high-le...,3
1,CSCI-GA.1170,Fundamental Algorithms Recitation,Bingwei Zhang,R 5:55-6:45PM,At least one year of experience with a high-le...,3
2,CSCI-GA.1170,Fundamental Algorithms Recitation,Bingwei Zhang,F 4:55-5:45PM,At least one year of experience with a high-le...,3
3,CSCI-GA.1170,Fundamental Algorithms,Yevgeniy Dodis,T 4:55-6:55PM,At least one year of experience with a high-le...,3
4,CSCI-GA.1170,Fundamental Algorithms Recitation,Eli Goldin,R 3:45-4:35PM,At least one year of experience with a high-le...,3
5,CSCI-GA.1180,Mathematical Techniques For CS Applications,Parijat Dube,R 7:10-9:10PM,,3
6,CSCI-GA.2110,Programming Languages,Cory Plock,M 4:55-6:55PM,Students taking this class should already have...,3
7,CSCI-GA.2110,Programming Languages Recitation,Anway Agte,R 7:10-8:00PM,Students taking this class should already have...,3
8,CSCI-GA.2110,Programming Languages Recitation,Hrithik Dhoka,F 11:15-12:05PM,Students taking this class should already have...,3
9,CSCI-GA.2110,Programming Languages,CANCELLED,-,Students taking this class should already have...,3


## CONCLUSION:
* ISSUES WITH THE DATA:
    * some of the classes are cancelled and therefore shouldn't be in the df because they aren't happening
        * Filter out rows with the value "CANCELLED" 
    * Some of the PHD classes say "STAFF" as the instructor which is ambigious
        * Replace "STAFF" with "TBD"
    * Some of the PHD classes also don't have a specified time
        * Replace the '-' with NaN value 
    * Some of the classes have a range for points (1-3)
        * Split the classes with a range for points into seperate rows... a row for the class with 1 point...2 points...3 points
    * The recitations say that they give the same amount of points as the lecture
        * make the points for all recitation sections as 0 and the lectures as the points value... add a columns that specifies if the row is a recitation or lecture
* MERGING
    * The left merge combines 2 df based on a common column or index (course number) while maintaining all the rows from the schedule df (first df). 