### 1. Markup Languages: The Structure of HTML Code

**Markdown:**
- HTML structure and its importance in web scraping.
- Introduce basic HTML tags, elements, and attributes (e.g., `<div>`, `<p>`, `class`, and `id`).

**Code:**
```html
# HTML Example for Demonstration
%%html
<!DOCTYPE html>
<html>
<head>
  <title>Sample Webpage</title>
</head>
<body>
  <h1>Welcome to the Sample Page</h1>
  <p>This is a paragraph with some <b>bold text</b>.</p>
  <div class="content" id="main">
    <p>Here is some content inside a div.</p>
  </div>
</body>
</html>
```

**Activity:**  
- Identify HTML tags, attributes, and text content.

### 2. Understanding HTML with BeautifulSoup

**Markdown:**
- BeautifulSoup library and how it’s used to parse HTML.
- How to install BeautifulSoup with `pip install beautifulsoup4`.

**Code:**
```python

In [None]:
# Importing BeautifulSoup and loading a sample HTML
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
 </body>
</html>



**Activity:**
- Print specific elements (e.g., the title, first link, all paragraphs).

In [None]:
from bs4 import BeautifulSoup

# Sample HTML Document
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
</body>
</html>
"""

# Parse the HTML document
soup = BeautifulSoup(html_doc, 'html.parser')

# Print the parsed HTML in a readable format
print("Prettified HTML:")
print(soup.prettify())

# Extract and print the title
title = soup.title
print("\nTitle of the page:")
print(title.string)

# Extract and print the first link
first_link = soup.find('a')  # Find the first <a> tag
print("\nFirst link:")
print(f"Link text: {first_link.string}")
print(f"Link URL: {first_link['href']}")

# Extract and print all paragraphs
paragraphs = soup.find_all('p')
print("\nAll paragraphs:")
for i, paragraph in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {paragraph.get_text()}")


Prettified HTML:
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
 </body>
</html>


Title of the page:
The Dormouse's story

First link:
Link text: Elsie
Link URL: http://example.com/elsie

All paragraphs:
Paragraph 1: The Dormouse's story
Paragraph 2: Once upon a time there were three little sisters; their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.


### 3. More Complex HTML Parsing

**Markdown:**
- Handling nested elements and complex HTML structures.
- Methods like `find()`, `find_all()`, `select()`, and CSS selectors.

**Code:**
```python

In [None]:
# Finding specific elements
print("Title tag:", soup.title)
print("First paragraph:", soup.find('p'))
print("All links:", soup.find_all('a'))

# Using CSS Selectors
print("Using CSS Selectors:", soup.select('p.story a'))

Title tag: <title>The Dormouse's story</title>
First paragraph: <p class="title"><b>The Dormouse's story</b></p>
All links: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Using CSS Selectors: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


**Activity:**
- Task students to retrieve all `<a>` tags and print their `href` attributes.

In [None]:
from bs4 import BeautifulSoup

# Sample HTML Document
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
</body>
</html>
"""

# Parse the HTML document
soup = BeautifulSoup(html_doc, 'html.parser')

# Find all <a> tags
all_links = soup.find_all('a')

# Print the href attribute of each <a> tag
print("All href attributes of <a> tags:")
for link in all_links:
    print(link['href'])


All href attributes of <a> tags:
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


### 4. Structuring Parsing Program Better

**Markdown:**
- How to create functions to handle repetitive tasks.
- Importance of modular code.

**Code:**
```python

In [None]:
# Function to retrieve all links from an HTML document
def get_all_links(soup):
    links = []
    for link in soup.find_all('a'):
        links.append(link.get('href'))
    return links

# Testing the function
print("Extracted Links:", get_all_links(soup))

Extracted Links: ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']


### 5. Splitting HTML Locators Out of the Python Class

**Markdown:**
- How to create locator constants for element IDs, classes, etc., to improve code readability.

**Code:**
```python

In [None]:
# Locator constants
LINK_LOCATOR = 'a.sister'  # CSS selector for links with class 'sister'

# Function to retrieve elements using locators
def get_elements_by_locator(soup, locator):
    return soup.select(locator)

print("Elements using Locator:", get_elements_by_locator(soup, LINK_LOCATOR))

Elements using Locator: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


### 6. Understanding HTML with the Browser

**Markdown:**
- How to use browser developer tools (Inspect Element) to identify elements and their locators.
- Example URL for a website to inspect.

### 7. Scraping the First Website with Python

**Markdown:**
- Introduce `requests` library for making HTTP requests.
- Response objects and handling HTTP status codes.

**Code:**
```python
# Installing requests library if not installed

In [None]:
!pip install requests



In [None]:
# Simple web scraping example
import requests

url = "http://quotes.toscrape.com/"
response = requests.get(url)

if response.status_code == 200:
    page_content = response.text
    soup = BeautifulSoup(page_content, 'html.parser')
    print("Page Title:", soup.title.string)
else:
    print("Failed to retrieve page:", response.status_code)

Page Title: Quotes to Scrape


**Activity:**
- Retrieve and print the main title and author names from the page.


In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the page to scrape
url = "http://quotes.toscrape.com/"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.text
    soup = BeautifulSoup(page_content, 'html.parser')

    # Print the page title
    print("Page Title:", soup.title.string)

    # Find and print all author names
    authors = soup.find_all('small', class_='author')
    print("Authors:")
    for author in authors:
        print(author.text)

else:
    print("Failed to retrieve page:", response.status_code)


Page Title: Quotes to Scrape
Authors:
Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


### 8. Milestone Project 3: A Quote Scraper

**Markdown:**
- Project’s goal: scraping quotes and authors from a website.

**Code:**
```python

In [None]:
# Initial setup for Quote Scraper
def get_quotes_from_page(soup):
    quotes = []
    for quote_block in soup.find_all('div', class_='quote'):
        text = quote_block.find('span', class_='text').get_text()
        author = quote_block.find('small', class_='author').get_text()
        quotes.append({'text': text, 'author': author})
    return quotes

# Testing the function with the first page
quotes = get_quotes_from_page(soup)
print("Quotes:", quotes[:5])  # Display first 5 quotes

Quotes: [{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}, {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}, {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}, {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}, {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}]


### 9. Quotes Project 2: Structuring a Scraping App in Python

**Markdown:**
- Demonstrate organizing the scraper into functions.

**Code:**
```python

In [None]:
# Full Quote Scraper function
def scrape_quotes(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return get_quotes_from_page(soup)

# Scraping quotes from a website
url = "http://quotes.toscrape.com/"
print("Quotes from page:", scrape_quotes(url))

Quotes from page: [{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}, {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}, {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}, {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}, {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}, {'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}, {'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide'}

### 10-13. Final Project Steps: Parsing, Getting Locators, and Recap

**Markdown:**
- Detailed steps to complete the scraper by adding locators, handling pagination, and crafting a parser class for better data handling.

**Code:**
```python

In [None]:
# Parser class for the Quote Scraper
class QuoteParser:
    def __init__(self, html):
        self.soup = BeautifulSoup(html, 'html.parser')

    def get_quotes(self):
        return get_quotes_from_page(self.soup)

In [None]:
# Scraping multiple pages
def scrape_multiple_pages(base_url, pages=5):
    all_quotes = []
    for page in range(1, pages+1):
        url = f"{base_url}page/{page}/"
        all_quotes.extend(scrape_quotes(url))
    return all_quotes

In [None]:
# Running the scraper
base_url = "http://quotes.toscrape.com/"
quotes = scrape_multiple_pages(base_url, 2)
print("All Quotes from multiple pages:", quotes)

All Quotes from multiple pages: [{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}, {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}, {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}, {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}, {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}, {'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}, {'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author':

#!Great Job