# Cleaning: Udacity's Course Catalog

In this activity, we will use the BeautifulSoup library to extract the following information from each course listed on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

### Step 1: Get text from Udacity's course catalog web page
We can use the `requests` library to do this.

In [32]:
# import
import requests
from bs4 import BeautifulSoup

In [22]:
# fetch web page
r = requests.get('https://www.udacity.com/courses/all')

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"`.

In [23]:
soup = BeautifulSoup(r.text, "lxml")

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name.

In [24]:
# Find all course summaries
summaries = soup.find_all('div', class_="_catalog-card-lemur_body__1-oK_")
print('Number of Courses:', len(summaries))

Number of Courses: 253


### Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [26]:
# print the first summary in summaries
print(summaries[0].prettify())

<div class="_catalog-card-lemur_body__1-oK_">
 <div>
  <ul class="_catalog-card-lemur_featureFlagContainer__979NY" data-mobileonly="false">
   <li data-type="new">
    <small>
     New
    </small>
   </li>
  </ul>
  <h2>
   Data Engineer
  </h2>
  <h3>
   School of Data Science
  </h3>
  <div class="_catalog-card-lemur_reviews__2nBv7" data-mobileonly="true">
   <div class="nd-rating-stars m--small">
    <div class="active-stars" style="width:91.04364326375712%">
    </div>
   </div>
   <small>
    1054
    <!-- -->
    reviews
   </small>
  </div>
  <p>
   Data Engineering is the foundation for the new world of Big Data. Enroll now to build production-ready data infrastructure, an essential skill for advancing your data career.
  </p>
 </div>
 <ul class="_catalog-card-lemur_stats__3ASYn">
  <li data-level="intermediate">
   intermediate
  </li>
  <li data-duration="">
   5 Months
  </li>
  <li>
   <div class="_catalog-card-lemur_reviews__2nBv7" data-mobileonly="false">
    <div class=

Look for selectors that contain the title of the courses and the text of the school name that you want to extract. In this case the information we are looking for is in selectors < h2 > and < h3 >.

In [27]:
# Extract course title
summaries[0].select_one("h2").get_text().strip()

'Data Engineer'

In [28]:
# Extract school
summaries[0].select_one("h3").get_text().strip()

'School of Data Science'

### Step 5: Collect names and schools of ALL course listings

In [30]:
courses = []
for summary in summaries:
    title = summary.select_one("h2").get_text().strip()
    school = summary.select_one("h3").get_text().strip()
    courses.append((title, school))

In [31]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

253 course summaries found. Sample:


[('Data Engineer', 'School of Data Science'),
 ('Digital Marketing', 'School of Business'),
 ('Introduction to Programming', 'School of Programming & Development'),
 ('Business Analytics', 'School of Business'),
 ('Data Scientist', 'School of Data Science'),
 ('Programming for Data Science with Python', 'School of Data Science'),
 ('Data Analyst', 'School of Data Science'),
 ('Product Manager', 'School of Business'),
 ('UX Designer', 'School of Business'),
 ('Front End Web Developer', 'School of Programming & Development'),
 ('Artificial Intelligence for Trading', 'School of Artificial Intelligence'),
 ('AI Programming with Python', 'School of Artificial Intelligence'),
 ('Machine Learning Engineer', 'School of Artificial Intelligence'),
 ('Full Stack Web Developer', 'School of Programming & Development'),
 ('Deep Learning', 'School of Artificial Intelligence'),
 ('Self Driving Car Engineer', 'School of Autonomous Systems'),
 ('C++', 'School of Autonomous Systems'),
 ('Data Structures 