# BeautifulSoup Extraction Tutorial
This notebook will guide you step by step to extract data from an HTML file using BeautifulSoup.

# HTML Elements Reference

| HTML Element  | Description  |
|--------------|-------------|
| `<a>`        | Defines a hyperlink, used for navigation. |
| `<h1> - <h6>` | Defines headings, where `<h1>` is the most important and `<h6>` the least. |
| `<p>`        | Defines a paragraph. |
| `href`       | Attribute used in `<a>` to define the link's destination. |
| `<div>`      | Defines a division or container in HTML. |
| `<span>`     | Defines an inline container, used for styling or grouping elements. |
| `<ul>`       | Defines an unordered list (bullet points). |
| `<ol>`       | Defines an ordered list (numbered list). |
| `<li>`       | Defines a list item inside `<ul>` or `<ol>`. |
| `<table>`    | Defines a table. |
| `<tr>`       | Defines a row in a table. |
| `<td>`       | Defines a cell inside a table row. |
| `<th>`       | Defines a header cell in a table. |
| `<img>`      | Embeds an image. |
| `<form>`     | Defines a form to collect user input. |
| `<input>`    | Defines an input field inside a form. |
| `<button>`   | Defines a clickable button. |
| `<label>`    | Defines a label for form elements. |
| `<textarea>` | Defines a multi-line text input. |
| `<select>`   | Defines a dropdown menu. |
| `<option>`   | Defines an option inside a `<select>` element. |
| `<meta>`     | Defines metadata (character set, viewport, etc.) for the document. |
| `<link>`     | Links external resources like CSS files. |
| `<script>`   | Embeds JavaScript into HTML. |
| `<style>`    | Defines CSS styles inside an HTML document. |
| `<iframe>`   | Embeds another webpage inside an HTML document. |
| `<br>`       | Creates a line break. |
| `<hr>`       | Creates a horizontal line. |



# BeautifulSoup: Web Scraping in Python 🥣🐍

## What is BeautifulSoup?

**BeautifulSoup** is a Python library used for **parsing HTML and XML documents**. It helps extract specific data from web pages by converting the HTML structure into a tree that is easy to navigate and manipulate.

It is mainly used for **web scraping**, where data is extracted from websites automatically.

---

## Why Use BeautifulSoup?

✅ **Easy to use** – Simple syntax to navigate and extract elements.  
✅ **Supports multiple parsers** – Works with `html.parser`, `lxml`, and `html5lib`.  
✅ **Handles messy HTML** – Can parse broken or incorrectly formatted HTML.  
✅ **Works well with Requests** – Can fetch web pages and extract data.

---

## Installing BeautifulSoup

To install BeautifulSoup and its dependencies, run:

```bash
pip install beautifulsoup4 requests


In [None]:
from bs4 import BeautifulSoup
import requests

### Step 2: Setting Up BeautifulSoup
To extract data from HTML, you need to install BeautifulSoup and requests (if you're fetching data from the web).

In [None]:
!pip install beautifulsoup4 requests



Now, load the HTML content into BeautifulSoup:

In [None]:
!pip install -qU langchain_community langchain


In [None]:
from langchain.document_loaders import AsyncHtmlLoader

url = ["https://dent.umich.edu/education/internationally-trained-dentist-program-itdp"]
loader = AsyncHtmlLoader(url)
docs = loader.load()

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  2.48it/s]


In [None]:
docs[0].page_content



In [None]:
html_content = docs[0].page_content

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')


### Step 3: Finding Tags - The Basics
Tags are elements in HTML, like `<a>`, `<div>`, and `<h1>`. BeautifulSoup allows you to search for these tags and retrieve the data inside them.

In [None]:
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag.text)  # Extracts the text inside the <a> tag


        Skip to main content
      
         Directions and Parking
           


         Directory
           


         Events
           


         Education
           


         Doctor of Dental Surgery (DDS)
           


         Dental Hygiene Undergraduate Program (BS)
           


         Dental Hygiene Degree Completion
           


         Oral Health Sciences PhD (OHS PhD)
           


         Internationally Trained Dentist Program (ITDP)
           


         Graduate Programs
           


         Master's Programs
           


         Dental Hygiene (MS)
           


         Endodontics Graduate Program (MS)
           


         Orthodontics Graduate Program (MS)
           


         Pediatric Dentistry Graduate Program (MS)
           


         Periodontics Graduate Program (MS)
           


         Prosthodontics Graduate Program (MS)
           


         Restorative Dentistry Graduate Program (MS)
           


         PhD Programs
      

### Step 4: Extracting Attributes - Working with Links
HTML tags often have attributes like `href` in `<a>` tags. Let's learn how to extract these attributes.

In [None]:
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag.get('href'))  # Extracts the href attribute

#page-content
/directions-and-parking
/people
/events
/education
/education/doctor-dental-surgery-dds
/education/dental-hygiene-undergraduate-program-bs
/education/dental-hygiene-degree-completion
/education/oral-health-sciences-phd
/education/internationally-trained-dentist-program-itdp
/education/graduate-programs
/education/masters-programs
/education/dental-hygiene-graduate-program-ms
/education/endodontics-graduate-program-ms
/education/orthodontics-graduate-program-ms
/education/pediatric-dentistry-graduate-program-ms
/education/periodontics-graduate-program-ms
/education/prosthodontics-graduate-program-ms
/education/graduate-restorative-dentistry-program-ms
/education/phd-programs
/education/msphd
/education/oral-health-sciences-phd-ohs-phd
/education/non-degree-programs
/education/craniofacial-orthodontics-fellowship-graduate-program-non-degree
/education/endodontics-graduate-program-non-degree
/education/international-team-implantology-iti
/education/orthodontics-graduate-prog

### Step 5: Navigating through the HTML Tree
HTML tags are structured in a tree format. You can navigate through this tree using parent, child, and sibling relationships.

In [None]:
h1_tag = soup.find('h4')
print(h1_tag.text)

 National Board Dental Examination


### Step 6: Working with Lists - `<li>` Tags
In your HTML, there are `<li>` tags that define list items. Let's learn how to extract data from lists.

In [None]:
li_tags = soup.find_all('li')
for tag in li_tags:
    print(tag.text)  # Extracts the text inside each <li> tag


         Directions and Parking
           




         Directory
           




         Events
           




         Education
           




         Doctor of Dental Surgery (DDS)
           




         Dental Hygiene Undergraduate Program (BS)
           




         Dental Hygiene Degree Completion
           




         Oral Health Sciences PhD (OHS PhD)
           




         Internationally Trained Dentist Program (ITDP)
           




         Graduate Programs
           




         Master's Programs
           




         Dental Hygiene (MS)
           




         Endodontics Graduate Program (MS)
           




         Orthodontics Graduate Program (MS)
           




         Pediatric Dentistry Graduate Program (MS)
           




         Periodontics Graduate Program (MS)
           




         Prosthodontics Graduate Program (MS)
           




         Restorative Dentistry Graduate Program (MS)
           






         PhD Programs
    

### Step 8: Extracting Data from Nested Tags
HTML often contains nested tags. BeautifulSoup allows you to extract data from these nested structures easily.

In [None]:
section_tag = soup.find('section')
nested_div = section_tag.find('div')
print(nested_div.text)
















### Step 9: Extracting All Information from the Page
To extract all the useful data from the HTML file (e.g., links, headings, paragraphs), you can combine all the concepts you've learned.

In [None]:
# # Extract all links
# links = soup.find_all('a')
# for link in links:
#     print(f"Link text: {link.text}, URL: {link.get('href')}")

# # Extract all headers
# headers = soup.find_all(['h1', 'h2', 'h3'])
# for header in headers:
#     print(f"Header: {header.text}")

# Extract all paragraphs
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(f"Paragraph: {paragraph.text}")

Paragraph: testing tabs feature of menu system
Paragraph: The Internationally Trained Dentist Program (ITDP) requires a full-time,
28-month commitment for graduates of non-US dental schools, in order to earn a Doctor of Dental
Surgery (DDS) degree.
Paragraph: Applicants to the Internationally Trained Dentist Program (ITDP) are reviewed in a holistic
		manner with consideration to all components of the application, including the criteria below. The following items are only accepted through the CAAPID application:
Paragraph: All applicants are required to report passing National Board results on the CAAPID application. Either the NBDE Part I or the INBDE is accepted. 
Paragraph: The Advanced Dental Admissions Test (ADAT) is recommended but not required
			for admission to the University of Michigan Internationally Trained Dentist
			Program.
Paragraph: Applicants who have taken the ADAT and submit scores via ADEA CAAPID will receive priority
			review. However, all applicants who meet ou

In [None]:
headers = soup.find_all(['h1', 'h2', 'h3'])
for header in headers:
    print(f"Header: {header.text}")

Header: Main navigation
Header: Internationally Trained Dentist Program (ITDP)
Header: Type your search terms and click Enter/Return to see full results
Header: APPLICANTS
Header: REQUIREMENTS
Header: LICENSURE DISCLOSURE
Header: HOW TO APPLY
Header: PROGRAM COSTS
Header: INTERVIEWS
Header: INTERVIEW INVITATIONS
Header: INTERVIEW ASSESSMENT
Header: PLANNING YOUR VISIT
Header: ADMITTED STUDENTS
Header: ACCEPTING THE OFFER OF ADMISSION
Header: MATRICULATION DOCUMENTS
Header: OFFICIAL DOCUMENTS
Header: HEALTH INSURANCE
Header: ADDITIONAL INFORMATION
Header: SOCIAL MEDIA
Header: SYNERGY DENTIST


In [None]:
headers

[<h2 class="sr-only" id="block-mainnavigation-2-menu">Main navigation</h2>,
 <h1 class="header-title overlay-none">Internationally Trained Dentist Program (ITDP)</h1>,
 <h2 class="hidden" style="font-size: 1.04em;text-shadow: 0 0 black;">Type your search terms and click Enter/Return to see full results</h2>,
 <h2 class="text-center bg-um-gray my-2">APPLICANTS</h2>,
 <h3>REQUIREMENTS</h3>,
 <h3>LICENSURE DISCLOSURE</h3>,
 <h3>HOW TO APPLY</h3>,
 <h3>PROGRAM COSTS</h3>,
 <h2 class="text-center bg-um-gray my-2">INTERVIEWS</h2>,
 <h3>INTERVIEW INVITATIONS</h3>,
 <h3>INTERVIEW ASSESSMENT</h3>,
 <h3>PLANNING YOUR VISIT</h3>,
 <h2 class="text-center bg-um-gray my-2">ADMITTED STUDENTS</h2>,
 <h3>ACCEPTING THE OFFER OF ADMISSION</h3>,
 <h3>MATRICULATION DOCUMENTS</h3>,
 <h3>OFFICIAL DOCUMENTS</h3>,
 <h3>HEALTH INSURANCE</h3>,
 <h3>ADDITIONAL INFORMATION</h3>,
 <h3>SOCIAL MEDIA</h3>,
 <h3>SYNERGY DENTIST</h3>]

### Step 10: Handling Complex HTML Structures
Sometimes the HTML structure might be more complex, with elements like tables or forms. You can use similar techniques to extract data from these complex structures.

In [None]:
table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    for col in cols:
        print(col.text)

1.

Create a CAAPID account and fill out the application (see the ADEA CAAPID
							Application Instructions page).

2.

Submit the following directly to CAAPID:



Educational Credential Evaluators (ECE) report

											Applicants must submit an official Educational Credential Evaluators (ECE)
											course-by-course evaluation of all dental coursework completed outside of the US and
											Canada. ECE reports should be sent directly to CAAPID. See instructions for sending
											foreign transcript evaluations on CAAPID’s Academic History
											page.
											Please note: The University of Michigan does not accept World
											Education Services (WES) evaluations.



Letters of recommendation

											Applicants must select three evaluators to submit letters of recommendation directly
											through CAAPID (see detailed
											instructions). See our specific Letters of Recommendation
											requirements.
										


Standardized tests

						