**Table of contents**<a id='toc0_'></a>    
- [Web Communication Fundamentals](#toc1_1_)    
    - [DNS, IP](#toc1_1_1_)    
    - [HTTP](#toc1_1_2_)    
    - [URL](#toc1_1_3_)    
  - [HTTP Requests and Responses](#toc1_2_)    
    - [Requests](#toc1_2_1_)    
    - [Response](#toc1_2_2_)    
- [What is Web Scraping](#toc2_)    
- [Web structure](#toc3_)    
  - [HTML](#toc3_1_)    
    - [Exploring Web Page Structures](#toc3_1_1_)    
    - [Fact 1: HTML is Built on Tags](#toc3_1_2_)    
    - [Fact 2: Tags Can Have Attributes](#toc3_1_3_)    
    - [Fact 3: Tags Can Be Nested](#toc3_1_4_)    
    - [Selecting Specific Elements in Web Scraping](#toc3_1_5_)    
- [Web Scraping with Python](#toc4_)    
    - [Requests: Fetching a Web Page](#toc4_1_1_)    
    - [Parsing HTML with Beautiful Soup](#toc4_1_2_)    
      - [Extracting Data](#toc4_1_2_1_)    
        - [**By Tag**](#toc4_1_2_1_1_)    
        - [**By Class**](#toc4_1_2_1_2_)    
        - [**By Tag and Class**](#toc4_1_2_1_3_)    
        - [**Getting other attributes**](#toc4_1_2_1_4_)    
      - [More filtering options](#toc4_1_2_2_)    
        - [Filtering by Multiple Tags](#toc4_1_2_2_1_)    
        - [Filtering by Multiple Classes](#toc4_1_2_2_2_)    
        - [Combining Multiple Criteria](#toc4_1_2_2_3_)    
        - [Limiting the Results](#toc4_1_2_2_4_)    
        - [Navigating through the "Tree" of HTML Elements](#toc4_1_2_2_5_)    
      - [Creating a DataFrame with the data](#toc4_1_2_3_)    
      - [💡 Check for understanding](#toc4_1_2_4_)    
      - [Scraping many pages](#toc4_1_2_5_)    
      - [CSS selectors](#toc4_1_2_6_)    
    - [More examples (self-guided)](#toc4_1_3_)    
      - [BBC](#toc4_1_3_1_)    
  - [Comments](#toc4_2_)    
  - [Summary](#toc4_3_)    
  - [Further materials](#toc4_4_)    
    - [How to Solve a 403 Error](#toc4_4_1_)    
    - [How to show an image from a URL](#toc4_4_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Web Communication Fundamentals](#toc0_)

![enjuto](https://www.publico.es/files/article_main/uploads//2014/12/13/548bb18f1d9ef.jpg)

### <a id='toc1_1_1_'></a>[DNS, IP](#toc0_)

[How do we connect to www.google.com?](https://www.youtube.com/watch?v=sUhEqT_HSBI&ab_channel=ProfeSang)
* **DNS (Domain Name System)**: it's essentially the phonebook of the internet. Humans access information online through domain names, like "google.com". Web browsers, however, interact through Internet Protocol (IP) addresses. In this example, the DNS maps the internet address www.google.com to the server's IP: 216.58.222.196
 * **IP**: Server identification. A code that allows information to be sent and received by the correct parties
 * **Domain Providers**: Sell and purchase Internet domains

### <a id='toc1_1_2_'></a>[HTTP](#toc0_)

**H**yper **T**ext **T**ransfer **P**rotocol
- HTTP is a communications protocol that provides a structure for requests between the client and the server on a network.
- For example, the web browser on the user's computer (the client) uses the HTTP protocol to request information from a website on a server.

### <a id='toc1_1_3_'></a>[URL](#toc0_)
Contains information about the resource being requested from the **server**.

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/http.png?raw=true)

Examples:

- https://www.google.com/webhp?authuser=2
- https://www.towardsdatascience.com
- https://www.ironhack.com/
    - Protocol: https (https == http is the same, but https is encrypted)
    - Domain Name --> ironhack
    - TLD (Top-Level Domain) --> .com

## <a id='toc1_2_'></a>[HTTP Requests and Responses](#toc0_)

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/request-response.png?raw=true)

**Requests** and **responses** are fundamental components of the client-server communication model.

### <a id='toc1_2_1_'></a>[Requests](#toc0_)

These are queries or calls **sent by the client** (such as a web browser or other software) **to the server** in order to receive information (a **response**). 

A request typically consists of:
- A method (such as GET, POST, PUT, DELETE) that defines the action to be performed
     * GET: read the information of the resource, without modifying it in any way. Accessing the website from the browser **gets** information.
- The URL or endpoint specifying the resource
- Optional additional information such as:
    - Headers (metadata): User-Agent, Accept-Language
    - Parameters
    - Body content. 
    
For example, a client might send a GET request to retrieve information from a web page or a POST request to submit form data.

### <a id='toc1_2_2_'></a>[Response](#toc0_)

These are the answers or data sent **by the server back to the client in reply to a request**. 

A response typically includes:
- A status code that indicates the success or failure of the request
- Headers with meta-information about the server's behavior
- The actual content or data (if applicable), such as HTML, JSON (similar to a Python dictionary), images, or other media types.

An important part of the **header** is the **status code**. This code is a numerical value that indicates the server's result. There are different status codes depending on whether the server has managed to carry out the request or has not managed to do anything. These are some groups of status codes:

- **2xx successful**: the request was successfully received, understood, and accepted
- **3xx redirection**: more actions are required to complete the request
- **4xx client error**: the request contains incorrect syntax or cannot be fulfilled
- **5xx server error**: the server has failed to complete an apparently valid request

Complete list:
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Much more fun:
https://http.cat/

# <a id='toc2_'></a>[What is Web Scraping](#toc0_)

Web scraping is a method employed by data analysts and developers to retrieve information from web pages. It involves fetching a web page and then parsing that page to obtain desired information. This technique is especially useful when the desired data isn't available through APIs. The extracted data can then be cleaned, analyzed, or stored in databases for further data analytics tasks. 

# <a id='toc3_'></a>[Web structure](#toc0_)

The fundamental web technologies that form the structure of the websites we aim to scrape are:

- **HTML**: Standing as the backbone of almost all websites, HTML, the core markup language, is instrumental in creating web pages. It houses all the content available on a webpage.
  
- **CSS**: This stylesheet language works alongside HTML, taking charge of the presentation aspect of the webpages. It controls how HTML elements are displayed, setting the stage for a visually pleasing and organized web interface.

- **JavaScript**: Adding a dynamic touch to the websites, JavaScript comes into play to create interactive and animated content. This programming language has the power to alter webpage content even after it has loaded, bringing a vivid and responsive element to web designs.

In this lesson, we will work with the HTML from the websites.

![html-css-javascript](https://imgs.search.brave.com/ru4Gn_cjhe4mwEOcd_Xk_foZHex9FeRpxDgMhQaiJJY/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9odG1s/LWNzcy1qcy5jb20v/aW1hZ2VzL29nLmpw/Zw)

## <a id='toc3_1_'></a>[HTML](#toc0_)

In the realm of web scraping, understanding HTML (Hypertext Markup Language) is crucial.

HTML is the standard markup language used to create web pages. Think of it as the skeleton or blueprint of a website. It structures content on the web, defining elements like paragraphs, headings, links, lists, and images. These elements are represented by "tags", which enclose content to give it meaning and context.

When web scraping, you'll often navigate through this HTML structure to pinpoint and extract the exact data you need. Tools like web browsers' "Inspect" or "View Source" features allow you to see the underlying HTML of a page, which is invaluable when determining how to access specific pieces of content programmatically.

![image.png](https://imgs.search.brave.com/7aw6NAyYkPZ7Y_flOndmq9DcP5hVk0lIgTSYMS0EjaU/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/YWxzYWNyZWF0aW9u/cy5jb20veG1lZGlh/L2RvYy9vcmlnaW5h/bC9odG1sNS1iYXNp/Yy5wbmc)

### <a id='toc3_1_1_'></a>[Exploring Web Page Structures](#toc0_)

To inspect the underlying HTML of a web page, right-click anywhere on the page. Choose "View Page Source" in browsers like Chrome or Firefox. For Internet Explorer, it's "View Source," and for Safari, select "Show Page Source." (In Safari, if this option isn't visible, navigate to Safari Preferences, click on the Advanced tab, and enable "Show Develop menu in menu bar.")

To embark on your web scraping journey, you just need to grasp **three foundational aspects** of HTML.


### <a id='toc3_1_2_'></a>[Fact 1: HTML is Built on Tags](#toc0_)

At its core, HTML is composed of content enveloped in `<tags>`. It typically houses the textual content we aim to scrape, adorned with these "tags" delineated by angle brackets. These tags provide structure and meaning, guiding the browser on how to display the content. The acronym "HTML" represents Hyper Text Markup Language.

HTML follows a tree-like structure, encompassing parent tags, child tags, and sibling tags:
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

For instance, consider the `<strong>` tag, signaling bold formatting. If "Jan. 21" is encapsulated between an opening `<strong>` tag and its corresponding closing `</strong>` tag, it denotes where the bold styling begins and ends. This pair of tags instructs the browser to render the enclosed text, "Jan. 21", in bold.

Tags come in various types, each suited to encapsulate specific content:
 * **Headings**: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * **Phrasing**: `<b>`, `<img>`, `<sub>`...
 * **Embedded Content**: `<audio>`, `<img>`, `<video>`...
 * **Tabulated Data**: `<table>`, `<tr>`, `<tbody>`...
 * **Page Sections**: `<header>`, `<section>`, `<article>`...
 * **Metadata and Scripts**: `<meta>`, `<title>`, `<script>`...


### <a id='toc3_1_3_'></a>[Fact 2: Tags Can Have Attributes](#toc0_)

HTML tags can possess "attributes," which are defined within the opening tag itself. 

Examine the following examples:

- `<span class="short-desc">`: Here, the `<span>` tag has a `class` attribute with the value "short-desc".
- `<div> Zapas Marca Joma X54 </div>`: This tag doesn't contain any attributes.
- `<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`: The `div` tag here has two attributes - `class` with the value "price-item" and `id` with the value "offer".
- `<div class="text-monospace" id="name_132" href="www.example.com"> Page Content </div>`: This `div` tag encompasses the following attributes:
    + **class**: With the value "text-monospace". Remember, the class isn't unique across the page.
    + **id**: With the value "name_132". IDs are meant to be unique identifiers for tags on the page.
    + **href**: With the value "www.example.com". The href commonly represents a link to another section of the page or to an external website.

**Key Notes**:
- The `id` attribute should be unique for a tag; no two tags should share the same `id`.
- The `class` attribute isn't meant to be unique. Instead, it often groups tags exhibiting similar behavior or styles.

For web scraping purposes, **understanding the semantics** behind terms like `<span>`, `class`, or `short-desc` **isn't crucial**. The key takeaway is recognizing that tags can possess attributes and understanding their structural representation. When extracting content, our goal is to pinpoint the right tags within a webpage's HTML.

**Other commonly used attributes in HTML**

Several attributes in HTML provide additional information or modify elements. Some of these frequently used attributes include:

 * **`dir`**: Determines the text direction within an element, allowing for either forward or backward writing.
 * **`lang`**: Designates the language of the content within an element.
 * **`style`**: Applies inline styling to an element (Note: This shouldn't be mixed up with the `<style>` tag).
 * **`title`**: Offers supplementary details about an element, often displayed as a tooltip (Important: This is distinct from the `<title>` tag).

...and many more.




### <a id='toc3_1_4_'></a>[Fact 3: Tags Can Be Nested](#toc0_)

Imagine the following segment of HTML code:

`Hello <strong><em>Ironhack</em> students</strong>`

Here, the phrase **Ironhack students** would be displayed in bold since it resides between the `<strong>` and `</strong>` tags. Additionally, the word ***Ironhack*** would be italicized due to the `<em>` tag, which signifies italic formatting. However, the word "Hello" remains unaffected by any formatting, as it lies outside both the `<strong>` and `<em>` tags. This results in the display:

Hello ***Ironhack* students**

This example illustrates a key principle: **tags influence the text from their opening to their closing points,** even if they are nested within other tags.

### <a id='toc3_1_5_'></a>[Selecting Specific Elements in Web Scraping](#toc0_)

When diving into web scraping, it's essential to target specific elements efficiently. To hone in on the precise content you need, consider filtering tags based on:
 
 * **Tag Name**: The main type of the element (e.g., `<div>`, `<a>`, `<p>`).
 * **Class**: A descriptor that groups multiple elements with similar characteristics.
 * **ID**: A unique identifier assigned to a particular element.
 * **Other Attributes**: Additional properties like `href`, `title`, or `lang` that can further specify the elements of interest.


# <a id='toc4_'></a>[Web Scraping with Python](#toc0_)

In this lesson, we'll use the `requests` library to fetch web pages and `BeautifulSoup` from the `bs4` package to parse these pages and extract information.

Ensure you've installed the required packages:

In [None]:
# You should know by now what to do here ;)
#!pip install requests beautifulsoup4
#!pip install requests requests

### <a id='toc4_1_1_'></a>[Requests: Fetching a Web Page](#toc0_)


First, we use the `requests` library to fetch the content of a webpage.

In [None]:
import requests

# Let's check out this website
url = "https://www.decathlon.com/collections/mountain-bikes"

The provided code retrieves the webpage content from the given URL and saves it in a `response` object. This object possesses either a `text` or `content` attribute, holding the HTML code similar to what we observe when inspecting the source in a web browser.

In [None]:
# Get headers

In [None]:
# Show the content type

In [None]:
# Show content

When interacting with APIs, we typically receive data in JSON format. However, web scraping provides us with HTML, which can be challenging to navigate. Fortunately, Beautiful Soup simplifies this process, making our work more manageable!

### <a id='toc4_1_2_'></a>[Parsing HTML with Beautiful Soup](#toc0_)

To parse the HTML, we'll employ the renowned Python library, [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). For a deeper understanding of its functionalities, explore the [official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

![what-soup](https://media.giphy.com/media/l0K42QFkQXQUkxZhS/giphy.gif)


In [None]:
from bs4 import BeautifulSoup

# Get a soup

The code above parses the HTML (stored in `response.content`) into a special object called `soup` that the Beautiful Soup library understands. In other words, Beautiful Soup is **reading the HTML and making sense of its structure.**

In [None]:
# What type of soup do we have?

In [None]:
# Can I make my soup pretty?

With the parsed HTML, we can now extract specific elements.

#### <a id='toc4_1_2_1_'></a>[Extracting Data](#toc0_)

`find` and `findAll` (or its equivalent `find_all`) are methods used to search the soup tree for tags that match a certain criterion.

1. **find**:
    - Returns only the **first** tag that matches a given set of criteria.
    - Useful when you know there's only one tag of interest or you only want the first occurrence.
    - Example: If you have multiple `<p>` tags on a page and you use `soup.find('p')`, you'll get only the first `<p>` tag.

2. **findAll (or find_all)**:
    - Returns a **list** of tags that match the given criteria.
    - Useful when you want to capture all occurrences of a particular tag or set of tags.
    - Example: Using `soup.find_all('p')` will give you a list containing all `<p>` tags on the page.

Here's a simple illustration:

```html
<html>
    <body>
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div>Some div.</div>
    </body>
</html>
```

Using `find('p')` would return the "First paragraph." while `find_all('p')` would return a list containing both "First paragraph." and "Second paragraph.".


Let's look at different ways of extracting data.

In [None]:
html = """
<html>
    <body>
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div>Some div.</div>
    </body>
</html>
"""

In [None]:
# Get pretty soup 

In [None]:
# Find paragraph

In [None]:
# Find all paragraphs

##### <a id='toc4_1_2_1_1_'></a>[**By Tag**](#toc0_)

Let's start with a popular tag: `title`.

In [None]:
# Find the first <title> tag on the page

In [None]:
# Find the all <title> tags on the page

##### <a id='toc4_1_2_1_2_'></a>[**By Class**](#toc0_)

To search for HTML elements by class in a webpage using BeautifulSoup, you can also use the `find` and `find_all` methods. 

1. **Using `find` method to get the first matching element**:
   
   ```python
   result = soup.find(class_='your-class-name')
   ```

2. **Using `find_all` method to get a list of all matching elements**:

   ```python
   results = soup.find_all(class_='your-class-name')
   ```
   
Note that we are using `class_` parameter because `class` is a reserved keyword in Python.

Let's dive into our target URL and explore its structure. Our objective is to craft a dataframe populated with bicycle names and their corresponding prices. 

To pinpoint the exact tags housing this information, follow these steps:
1. Navigate to the website in your browser.
2. Locate a bicycle name, right-click on it, and choose 'Inspect'. This action will direct you to the element within the site's HTML. Identify the tags so we can extract our desired data.
3. Repeat the same process for the price.

Note: the bicycle names and prices will change depending on the newest bikes in the shop.

Let's filter all elements which `class` is `de-ProductTile-title`.

In this case, the results of `class` `de-ProductTile-title` are all inside `h4` tags and we actually got the information we wanted. But what if the `class` `de-ProductTile-title` was inside different `tags` and we only want the results of the `tag h4`?

##### <a id='toc4_1_2_1_3_'></a>[**By Tag and Class**](#toc0_)

BeautifulSoup allows filtering results using combinations, such as filtering by tag and class. 

```python
tags = soup.find_all(name=tag_name, class_=class_name)
```

We can use a for loop to iterate over the results and do whatever we need to do.

To extract the names from the provided HTML content, you can:

1. Use the `findAll` method to locate the `<h4>` tags with the specific class (`de-ProductTile-title` in this case).
2. Extract the text from the found tag.

In [None]:
# Find all <h4> tags with class "de-ProductTile-title"

In [None]:
# Findall returns a list

In [None]:
# What type are the tags?

In [None]:
# Lets look at how many elements we retrieved

In [None]:
# Lets look at the first element

In [None]:
# We can get the actual text using .text or .getText()

In [None]:
# Lets get rid of all the white spaces

In [None]:
# Create a new list only with bicycle names

To extract the price from the provided HTML content, you can:

1. Use the `findAll` method to locate the `<span>` tags with the specific class (`js-de-ProductTile-currentPrice` in this case).
2. Extract the text from the found tag.


In [None]:
# Find the <span> tag and get its text

In [None]:
# Show the text for one price tag

In [None]:
# Show the text for all price tags

##### <a id='toc4_1_2_1_4_'></a>[**Getting other attributes**](#toc0_)

To access other attribute values such as hyperlinks (which are usually contained in the `href` attribute of `a` tags), you first locate the element using BeautifulSoup methods such as `find` or `find_all`, and then use the `.get()` method to retrieve the value of the attribute you're interested in. Here is a step-by-step explanation:

1. **Locate the element**: Use `find` or `find_all` to locate the element(s) that contain the attribute you want to access.

    ```python
    link_element = soup.find('a', class_='link-class')
    ```

2. **Access the attribute**: Once you have the element, use the `.get()` method to access the attribute value.

    ```python
    link_url = link_element.get('href')
    ```

In the above snippet:
- We first find the `a` element with the class `'link-class'`.
- We then get the value of the `href` attribute which contains the hyperlink.


When inspecting the website, we saw that the bicycle title was a link. How can we get that link?
Lets inspect the whole element containing the bicycle name instead of just the name. 

We can see that we have:
    
    <a class="de-u-linkClean js-de-ProductTile-link" href="/collections/mountain-bikes/products/mountain-bike-275-rockrider-st-100-196952-192872">
        
Note: this link will change depending on the newest bike in the shop.

In [None]:
# Find the <a> tag and get its href attribute

In [None]:
# Extracting all links within <a> tags and class 'de-u-linkClean js-de-ProductTile-link'

Finding all the hyperlinks in a page, then extracting new hyperlinks as you visit the other pages is called **web crawling** and is what Google uses to map out all of the websites that you're looking for 👀

#### <a id='toc4_1_2_2_'></a>[More filtering options](#toc0_)

##### <a id='toc4_1_2_2_1_'></a>[Filtering by Multiple Tags](#toc0_)

To find elements with multiple possible tags, you can pass a list of tag names to `find_all`.

In [None]:
# Find all <div> and <span> tags

In [None]:
# Find all the classes in a certain tag
classes_ = []

##### <a id='toc4_1_2_2_2_'></a>[Filtering by Multiple Classes](#toc0_)

To find elements with multiple possible classes, you can pass a list of class names.


In [None]:
# Find all elements with class 'js-de-ProductTile-currentPrice' or 'de-ProductTile-title'

##### <a id='toc4_1_2_2_3_'></a>[Combining Multiple Criteria](#toc0_)

You can combine multiple criteria by using the `attrs` argument.

In [None]:
# Find all <div> or <span> tags with class "class1" or "class2"

##### <a id='toc4_1_2_2_4_'></a>[Limiting the Results](#toc0_)

You can limit the number of results returned by `find_all` using the `limit` parameter.

In [None]:
# Only get the first 5 matches

##### <a id='toc4_1_2_2_5_'></a>[Navigating through the "Tree" of HTML Elements](#toc0_)

Beautiful Soup provides a robust set of tools that allow you to traverse and explore the hierarchical structure of an HTML document, often referred to as the "tree". 

To access child elements directly:

In [None]:
# Show a tag from the soup

In [None]:
# Get all h4 elements in all divs

The code above will first locate the initial `div` element present in the Beautiful Soup object. Subsequently, it will fetch all `h4` elements contained within that `div`.

But what if you need to retrieve a specific child by its position, say the second child?

In [None]:
#  Get the second element

#### <a id='toc4_1_2_3_'></a>[Creating a DataFrame with the data](#toc0_)

Instead of getting names and prices separately, we can target the whole component, and extract the name and price from each bicycle component in a more structured manner. By targeting this whole component tag, we can ensure that we are extracting information for the same product (i.e., the name and price correspond to the same bicycle).

Here's how we can achieve this:

1. **Targeting the Whole Component**:
   - Instead of targeting individual tags for names and prices, we target the main component that houses both the name and price.
   - By visually inspecting the HTML, we can see that:
       - The information for each bicycle (name, price, etc.) is grouped together under a `<section>` tag. 
       - The `class` attribute of this `<section>` tag is `de-ProductTile-info`. This class seemed specific to the product tile and thus, a good candidate to use for extraction.
   
2. **Iterating through Components**:
   - For each such component, extract the name and the price.
   
3. **Storing Data**:
   - Store the extracted data in lists, which can then be used to create a DataFrame.

In [None]:
import pandas as pd

# Lists to store extracted data
bicycle_names = []
prices = []

# Find all components

for component in components:
    # Extract bicycle name
    
    # Extract price

# Create DataFrame
df = pd.DataFrame({
    'Bicycle_Name': bicycle_names,
    'Price': prices
})

df

We could clean even more our dataset so the price can be a float, and we can easily make operations with it. This means, we shouldnt have range of prices.

#### <a id='toc4_1_2_4_'></a>[💡 Check for understanding](#toc0_)

You are given a raw HTML content of a product list from an online store. Your task is to extract the following details for each product:

- Bicycle Name
- Bicycle Price
- URL for the product details
- URL for the product image

Write a function `extract_bike_info` that takes in the HTML content and returns a pandas DataFrame with the above columns.

<details><summary><b>Hint</b></summary>

In order to get the product image, might be a good idea to use the `article` tag with the class `de-ProductTile` since based on the HTML structure, this `article` tag encapsulates the entire product, including both the image and the product details. This allows us to more easily access all the relevant details for each product without having to jump around different sections.

If we were to only use `soup.find_all('section', class_='de-ProductTile-info')`, we'd be focusing solely on the product details section and would then need a separate approach to extract the image URL. By starting with the `article` tag, we're able to extract all the needed data in a more cohesive and streamlined manner.
</details>

**Bonus:** clean the price column so you can make numerical operations.

In [None]:
# Your code here

In [None]:
# Your function should show up a dataframe here
df = extract_bike_info(soup)
df

#### <a id='toc4_1_2_5_'></a>[Scraping many pages](#toc0_)

When dealing with a limited number of bicycles, all products are conveniently displayed on a single page. But what if there were numerous products necessitating pagination across multiple pages?

Consider the 'deals' collection. By navigating to the end of its first page on the website, we can observe pagination links. Transitioning to the second page results in a change in the URL:

From: 
"https://www.decathlon.com/collections/deals"
To: 
"https://www.decathlon.com/collections/deals?page=2"

This pattern in the URL can be leveraged to generate a series of URLs for web scraping.

Please note: Depending on the current offers available at the time of this lesson, pagination might not be present. If that's the case, explore other product categories that have a substantial number of items, resulting in multiple pages.

In [None]:
pages = [f"https://www.decathlon.com/collections/deals?page={pag}" for pag in range(1, 5)]

In [None]:
pages

Now lets build a df for each URL using our function from before:

In [None]:
# function get_df_from_url

In [None]:
# Extract all the dfs

In [None]:
# Show the first df

In [None]:
# Concatenate all DataFrames in the list

If you look at our results, and compare it with the website, you'll see that its not returning all the products. Each page has more than 9 products, and its only returning 9 on each page.

This could be because the content is dynamic. 

**Dynamic Content**: Many modern websites use JavaScript to load content dynamically. When you make a request using libraries like `requests`, you're only getting the initial HTML content. Any content loaded dynamically via JavaScript after the initial page load won't be captured. In such cases, tools like Selenium are used because they can interact with the JavaScript of the page.


#### <a id='toc4_1_2_6_'></a>[CSS selectors](#toc0_)

CSS selectors are patterns used to select and manipulate one or more elements in an HTML or XML document. When web scraping with Python, CSS selectors can be used to target specific elements of interest within the page's content. 

The `select` method in BeautifulSoup allows you to pass a CSS selector and returns a list of elements matching that selector.

1. **Tag Selector**: Targets elements by their tag name.
   - `p`: selects all `<p>` elements.
   - `soup.select("p")`

2. **Class Selector**: Targets elements by their class attribute.
   - `.classname`: selects all elements with `class="classname"`.
   - If class name has spaces, they must be changed by `.`
   - `soup.select(".classname")`
   - To combine both, we can have `soup.select("tagname.classname")`

3. **Descendant Selector**: Targets an element that is a descendant of another element.
   - `div p`: selects all `<p>` elements inside a `<div>` element.
   - `.class1 .class2`: selects all elements with class2 that is a descendant of an element with class1.
   
4. **Attribute Selector**: Targets elements based on their attributes and values.
   - `a[href]`: selects all `<a>` elements with an `href` attribute.
   - `a[href="https://www.example.com"]`: selects all `<a>` elements with an `href` value of "https://www.example.com".

And more...


1. **Tag Selector**:
   - **`article`**: This would select all `<article>` elements on the page.
  

In [None]:
# Select all article tags

2. **Class Selector**:
   - **`.de-ProductTile`**: This would select all elements with the class `de-ProductTile`.

In [None]:
# Select all elements with class .de-ProductTile

To combine both, we can have `soup.select("tagname.classname")`

In [None]:
# Show combo

Without CSS selectors we did:

In [None]:
# Extracting all links within <a> tags and class 'de-u-linkClean js-de-ProductTile-link'

Equivalently, using CSS selectors, which is a universal syntax, you can try and find `tag_name.class_name`. If class name has spaces, they must be changed by `.`

In [None]:
# Extracting all links within <a> tags and class 'de-u-linkClean js-de-ProductTile-link'

3. **Descendant Selector**:
   - **`.de-ProductTile .de-ProductTile-title`**: This would select all elements with the class `de-ProductTile-title` that are descendants of elements with the class `de-ProductTile`.
   - **`article h4`**: This would select all `<h4>` elements that are descendants of `<article>` elements.

In [None]:
# .de-ProductTile-title descendants of .de-ProductTile class

In [None]:
# h4 descendants of article tags

In [None]:
# how many spans?

In [None]:
# how many spans inside spans?

In [None]:
# how many spans inside spans inside spans?

![](https://media.giphy.com/media/3oEjHE6anD68swMCyI/giphy.gif)

In [None]:
# how many span inside div inside div inside div ...

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

You can also use a combination of `find`, `find_all`, and `select` methods to navigate and locate the elements you're interested in more efficiently. Here's how you can use them together:

1. **Using `find` or `find_all` to Narrow Down the Search Scope**:
   
   Initially, you can use `find` or `find_all` to narrow down your search to a specific section of the HTML document.

   ```python
   section = soup.find('div', class_='product-section')
   ```

2. **Using `select` to Further Locate Elements**:

   After narrowing down the section, you can use the `select` method to locate elements using CSS selectors, which allow for more complex queries. The `select` method can be used on a BeautifulSoup object or on a Tag object (like the one retrieved in step 1).

   ```python
   product_links = section.select('a.product-link')
   ```

In this snippet:
- First, we locate a section of the webpage using `find`.
- Then, within that section, we locate all `a` elements with the class `'product-link'` using `select`.

In [None]:
# Find details for the first product

# Extract the main information about the product

In [None]:
# Since its a list we need to access the element to get the text

### <a id='toc4_1_3_'></a>[More examples (self-guided)](#toc0_)

#### <a id='toc4_1_3_1_'></a>[BBC](#toc0_)

Lets scrape the BBC to gather some information.

We'll get the hyperlinks to images from the BBC website.

In [None]:
response = requests.get("https://www.bbc.com/")
response

In [None]:
soup = BeautifulSoup(response.content)

In [None]:
img_tags = soup.find_all("img") # We get all the image elements

In [None]:
len(img_tags) # Lets see how many we got

In [None]:
img_tags[1] # For example, lets look at the second one to see how we can get the actual URL

In [None]:
# We can use the get method to get the src attribute which contains the URL to the image
img_tags[1].get("src")

If we inspect the top menu of the BBC, we see that we have the `nav` element with the class `orbit-header-links international`. To get the names from the menu, we need to locate all the span elements inside this nav element.

In [None]:
# Find the nav element with the specified class
nav_element = soup.find('nav', class_='orbit-header-links international')

# Find all span elements inside the nav element
menu_names = [span.get_text() for span in nav_element.find_all('span')]

# Print the menu names
for name in menu_names:
    print(name)

## <a id='toc4_2_'></a>[Comments](#toc0_)

It's always recommended to check for the availability of an **API** (we'll see next lesson) before resorting to web scraping for the following reasons:
 * It is generally much easier to use
 * APIs are usually well-documented
 * Utilizing APIs is often preferred by server administrators

Refer to the `robots.txt` file on a website (by doing `www.example.com/robots.txt`) to understand the server's guidelines and limitations regarding web scraping.

## <a id='toc4_3_'></a>[Summary](#toc0_)

1. **Web Technologies**:
   - **HTML**: This is the markup language that holds the content of the webpage. It is the primary target when we engage in web scraping.
   - **CSS**: Cascading Style Sheets are used to describe the look and formatting of a document written in HTML. 
   - **JavaScript**: This is a scripting language used to create and control dynamic website content.

2. **HTML Structure**:
   - **Hierarchical**: HTML documents are structured hierarchically, meaning elements are nested within other elements, forming a tree-like structure.
   - **Tags**: These are the building blocks of HTML, defining elements that hold different types of content.
   - **Attributes**: HTML tags can have attributes, which define properties of an element and are used to set various characteristics such as class, ID, and style.

3. **Web Scraping Tools**:
   - **Requests**: A Python library that allows you to send HTTP requests to get the HTML content of a webpage.
   - **Beautiful Soup**: A Python library that facilitates the programmatic analysis of HTML, helping in parsing the HTML and navigating the parse tree.
   - **Selenium**: In cases where the webpage content is dynamic, generated using JavaScript, a tool like Selenium becomes necessary. Selenium can interact with JavaScript to load dynamic content, making it accessible for scraping.
   
4. **Finding and Selecting Elements**:
   - **Selection by Tag, Class, and ID**: We can find elements using various attributes such as their tag name, class name, or ID.
   - **CSS Selectors**: These are patterns used to select elements more complexly, leveraging the relationships between different elements to find them in numerous ways.


## <a id='toc4_4_'></a>[Further materials](#toc0_)

[Web archive](http://web.archive.org/): find historical webpages state in the past!!

Articles:
- [Deep versus Dark Web](https://www.britannica.com/story/whats-the-difference-between-the-deep-web-and-the-dark-web)

Videos:
- [How does the internet work?](https://www.youtube.com/watch?v=x3c1ih2NJEg) (9 min) - also seen in class 
- [How are packets transmitted?](https://www.youtube.com/watch?v=7_LPdttKXPc) (5 min)

### <a id='toc4_4_1_'></a>[How to Solve a 403 Error](#toc0_)

When you get a `403` status code in response to a web request, it means "Forbidden." The server understands your request, but it refuses to fulfill it. This is often a measure by websites to prevent web scraping or automated access.

Here's why you might get a `403 Forbidden` error:

1. **User-Agent**: Many websites block requests that don't have a standard web browser User-Agent. The default User-Agent of the `requests` library often gets blocked.
2. **Robots.txt**: This is a file websites use to guide web crawlers about which pages or sections of the site shouldn't be processed or scanned. Respect it.
3. **Rate Limiting**: Websites might block you if you make too many requests in a short period.
And more...

To solve it, try the following, starting from the user-agent:

1. **Change the User-Agent**:
   You can mimic a request from a web browser by setting a User-Agent header.
   ```python
   headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
   }
   response = requests.get(url, headers=headers)
   ```

2. **Use a Web Scraper Library**:
   Libraries like Scrapy or Selenium can help bypass restrictions, especially when JavaScript rendering is involved.

3. **Respect `robots.txt`**:
   Always check `https://www.example.com/robots.txt` (replace `example.com` with the website's domain) to see which URLs you're allowed to access.

4. **Rate Limiting**:
   Implement delays in your requests using `time.sleep(seconds)` to avoid hitting rate limits.

5. **Use Proxies or VPN**:
   Rotate IP addresses or use a VPN service if the server has blocked your IP.

6. **Sessions & Cookies**:
   Some websites might require maintaining sessions or handling cookies.


### <a id='toc4_4_2_'></a>[How to show an image from a URL](#toc0_)

In [None]:
from io import BytesIO
from PIL import Image

response = requests.get('https://i.imgflip.com/7zldoz.jpg')
Image.open(BytesIO(response.content))