# Introduction to Scraping

## What is scraping?

🤓 Scraping refers to the process of extracting data from websites or other sources using automated scripts. This is often done to gather information that is not readily available through APIs or other structured data formats.

Python is a popular language for web scraping due to its simplicity and the availability of powerful libraries designed for this purpose.

## HTML

HTML (HyperText Markup Language) is the standard markup language used to create and design web pages. It provides the structure of a web page, defining elements such as headings, paragraphs, links, images, and other content.

### Basic Structure of an HTML Document

An HTML document consists of a series of elements, which are defined by tags. Here is a simple example of an HTML document:

```html
<!DOCTYPE html>
<html>
<head>
    <title>My First HTML Page</title>
</head>
<body>
    <h1>Welcome to My Website</h1>
    <p>This is a paragraph of text.</p>
    <a href="https://perdu.com">Visit perdu.com</a>
    <img src="https://fr.m.wikipedia.org/wiki/Fichier:Python-logo-notext.svg" alt="A great image!">
</body>
</html>
```

❓ **>>>** Let's try copy/paste this code inside a file and then read it with your browser!

### HTML Table

An HTML table is defined using the `<table>` tag. Inside the `<table>` tag, you can use various other tags to define the structure and content of the table:

- **`<tr>`**: Defines a table row.
- **`<th>`**: Defines a table header cell.
- **`<td>`**: Defines a table data cell.
- **`<thead>`**: Groups the header content in a table.
- **`<tbody>`**: Groups the body content in a table.
- **`<tfoot>`**: Groups the footer content in a table.

Exemple:

```html
<!DOCTYPE html>
<html>
<head>
    <title>HTML Table Example</title>
</head>
<body>
    <table border="1">
        <thead>
            <tr>
                <th>Header 1</th>
                <th>Header 2</th>
                <th>Header 3</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Row 1, Cell 1</td>
                <td>Row 1, Cell 2</td>
                <td>Row 1, Cell 3</td>
            </tr>
            <tr>
                <td>Row 2, Cell 1</td>
                <td>Row 2, Cell 2</td>
                <td>Row 2, Cell 3</td>
            </tr>
        </tbody>
        <tfoot>
            <tr>
                <td>Footer 1</td>
                <td>Footer 2</td>
                <td>Footer 3</td>
            </tr>
        </tfoot>
    </table>
</body>
</html>
```

❓ **>>>** Let's try copy/paste this code inside a file and then read it with your browser!

### HTML Lists

#### `<ul>` and `<ol>` Tag

The `<ul>` (unordered list) and `<ol>` are used to create a list. Each item in an unordered list is marked with a bullet point by default. And each item of an `<ol>` has a number by default.

#### `<li>` Tag

The `<li>` tag stands for "list item." It is used to define individual items within a list. Each `<li>` element must be contained within a `<ul>` (unordered list) or `<ol>` (ordered list) element.


```html
<!DOCTYPE html>
<html>
<head>
    <title>Unordered List Example</title>
</head>
<body>
    <h2>My Favorite Fruits</h2>
    <ul>
        <li>Apple</li>
        <li>Banana</li>
        <li>Cherry</li>
        <li>Date</li>
    </ul>
</body>
</html>
```

### Nested Lists

You can also create nested lists by placing one list inside another. This is useful for creating hierarchical structures.

```html
<!DOCTYPE html>
<html>
<head>
    <title>Nested Unordered List</title>
</head>
<body>
    <h2>My Favorite Fruits</h2>
    <ul>
        <li>Apple</li>
        <li>Banana</li>
        <li>Cherry
            <ul>
                <li>Red Cherry</li>
                <li>Black Cherry</li>
            </ul>
        </li>
        <li>Date</li>
    </ul>
</body>
</html>
```

### The `<style>` tag

The `<style>` tag in HTML is used to define internal CSS styles for a web page. These styles are applied to the elements within the same HTML document. The `<style>` tag is typically placed within the `<head>` section of the HTML document (although it can also be placed within the `<body>` section if needed).

### Basic Syntax

The `<style>` tag contains CSS rules that define how HTML elements should be displayed. Those rules can be applied on various elements. Let's just see two different ways to implement them. Using either the **element selector** or the **class selector**.

#### Different types of Selectors

- **Element Selector**: Applies styles to all elements of a specific type.
  ```css
  h1 {
      color: blue;
  }
  ```

- **Class Selector**: Applies styles to all elements with a specific class attribute. By convention we start with a "." in before the name.
  ```css
  .container {
      border: 1px solid #ccc;
      padding: 10px;
      margin-bottom: 10px;
      border-radius: 5px;
      background-color: #fff;
  }
  ```

#### Example
```html
<!DOCTYPE html>
<html>
<head>
    <title>Example Page</title>
    <style>
        # element selector
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
            line-height: 1.6;
        }
        # class selector
        .container {
            border: 1px solid #ccc;
            padding: 10px;
            margin-bottom: 10px;
            border-radius: 5px;
            background-color: #fff;
        }
        # element selector
        h1 {
            color: blue;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>Welcome to My Website</h1>
        <p>This is a paragraph of text.</p>
    </div>
</body>
</html>
```


### The `<div>` tag

In HTML, the `<div>` tag stands for "division" and is used to define a section or a container for other HTML elements. It is a block-level element, meaning it starts on a new line and takes up the full width available. The `<div>` tag is commonly used to group together HTML elements for styling purposes, using CSS, or for scripting purposes, using JavaScript.

#### Attributes of `<div>`

The `<div>` tag can have various attributes to provide additional information or functionality.

- **`id`**: Specifies a unique identifier for the `<div>` element.
- **`class`**: Specifies one or more class names for the `<div>` element, allowing you to apply styles from CSS (white space is used as seperator for multiple heritages).
- **`style`**: Specifies inline CSS styles for the `<div>` element.
- **`title`**: Provides additional information about the `<div>` element, which is displayed as a tooltip when the mouse hovers over the element.

#### Example with Several `<div>` Elements and Attributes

```html
<!DOCTYPE html>
<html>
<head>
    <title>HTML Div Example with Attributes</title>
    <style>
        .container {
            border: 1px solid black;
            padding: 10px;
            margin: 10px;
        }
        .header {
            background-color: lightblue;
            text-align: center;
        }
        .content {
            background-color: lightgray;
        }
        .footer {
            background-color: lightgreen;
            text-align: center;
        }
        .special {
            color: red;
            font-weight: bold;
        }
    </style>
</head>
<body>
    <div id="header" class="container header" title="This is the header">
        <h1>Welcome to My Website</h1>
    </div>
    <div id="content" class="container content" style="font-size: 16px;">
        <p>This is the main content of the page.</p>
        <p class="special">This is a special paragraph.</p>
        <div id="nested" class="container" style="background-color: lightyellow;">
            <p>This is a nested div inside the content div.</p>
        </div>
    </div>
    <div id="footer" class="container footer">
        <p>This is the footer of the page.</p>
    </div>
</body>
</html>
```

❓ **>>>** Code an html website that looks like this :

![web site image](files/web_site.png)

**hint**: The ".container" style (class selector) has the following properties:

```
            font-family: Arial, sans-serif;
            margin: 10px;
            line-height: 1.1;
            border: 1px solid #ccc;
            padding: 10px;
            margin-bottom: 10px;
            border-radius: 5px;
            background-color: #fff;
```

And the <th> and <td> tags (element selectors) from the table have this style:
    
``` 
            border: 1px solid #ccc;
            padding: 8px;
            text-align: left;
```

## Scraping with ```requests```

**Requests** is a library based on **urllib** that allows us to send requests in a very concise way. Let's use it to get the source code from this [Books To Scrape](https://books.toscrape.com/) : 

In [None]:
import requests

books_url = 'https://books.toscrape.com/'

r = requests.get(books_url)

page = r.text

# Display the first 5000 characters
#page[:5000]

## Exercice

❓ **>>>** Use Python built-in functions to extract the left menu and make it a list.

✅ Expected output:

['Books',
 'Travel',
 'Mystery',
 'Historical Fiction',
 'Sequential Art',
 'Classics',
 'Philosophy',
 'Romance',
 'Womens Fiction',
 'Fiction',
 'Childrens',
 'Religion',
 'Nonfiction',
 'Music',
 'Default',
 'Science Fiction',
 'Sports and Games',
 'Add a comment',
 'Fantasy',
 'New Adult',
 'Young Adult',
 'Science',
 'Poetry',
 'Paranormal',
 'Art',
 'Psychology',
 'Autobiography',
 'Parenting',
 'Adult Fiction',
 'Humor',
 'Horror',
 'History',
 'Food and Drink',
 'Christian Fiction',
 'Business',
 'Biography',
 'Thriller',
 'Contemporary',
 'Spirituality',
 'Academic',
 'Self Help',
 'Historical',
 'Christian',
 'Suspense',
 'Short Stories',
 'Novels',
 'Health',
 'Politics',
 'Cultural',
 'Erotica',
 'Crime']

In [None]:
# Code here!


## HTTP Request

An HTTP (HyperText Transfer Protocol) request is a message sent by a client (such as a web browser, a mobile app or... Python!) to a server to perform an action, such as retrieving a web page, submitting a form, or interacting with an API. HTTP is the foundation of data communication on the World Wide Web.

### Components of an HTTP Request

1. **Request Line**:
   - **Method**: Specifies the action to be performed (e.g., GET, POST, PUT, DELETE).
   - **URL**: The Uniform Resource Locator (URL) of the resource being requested.
   - **HTTP Version**: The version of the HTTP protocol being used (e.g., HTTP/1.1, HTTP/2).

   Example: `GET /index.html HTTP/1.1`

2. **Headers**:
   - **Host**: The domain name of the server (e.g., `www.example.com`).
   - **User-Agent**: Information about the client software (e.g., browser type and version).
   - **Accept**: The types of data that the client can process (e.g., `text/html`, `application/json`).
   - **Authorization**: Authentication information (e.g., API tokens).
   - **Content-Type**: The media type of the resource being sent (e.g., `application/json`).
   - **Content-Length**: The length of the request body in bytes.

   Example:
   ```
   Host: www.example.com
   User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3
   Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
   Authorization: Bearer YOUR_ACCESS_TOKEN
   Content-Type: application/json
   Content-Length: 34
   ```

3. **Body**:
   - The body of the request contains the data being sent to the server. It is typically used with methods like POST and PUT.
   - The body can contain form data, JSON data, XML data, or other types of data.

   Example (JSON body):
   ```json
   {
       "name": "John Doe",
       "email": "john.doe@example.com"
   }
   ```

### The `headers` parameter with the library Requests

In the `requests` library for Python, the `headers` parameter is used to specify HTTP headers to be sent with the request. HTTP headers are key-value pairs that provide additional information about the request or the response, such as the type of content being sent, authentication details, or caching instructions.

### Common Use Cases for Headers

1. **Content-Type**: Specifies the media type of the resource.
   ```python
   headers = {
       'Content-Type': 'application/json'
   }
   ```

2. **Authorization**: Used for sending authentication tokens.
   ```python
   headers = {
       'Authorization': 'Bearer YOUR_ACCESS_TOKEN'
   }
   ```

3. **User-Agent**: Identifies the client software originating the request.
   ```python
   headers = {
       'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
   }
   ```
   
## IMDB

IMDB (Internet Movie DatabBase) is a very famous website which is fun to scrape. Let's try to scrap all movies on this [page](https://www.imdb.com/chart/top/).

In [None]:
imdb_url = 'https://www.imdb.com/chart/top/'
import requests
r = requests.get(imdb_url)
r.text

This doesn't sem to work... Why? Because we didn't specify an agent! Let's try with a basic agent :

In [None]:
imdb_url = 'https://www.imdb.com/chart/top/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
r = requests.get(imdb_url, headers=headers)
r.text[:500]

**Tip:** You can use [this library](https://pypi.org/project/fake-useragent/) to generate fake agents.

## Exercice

❓ **>>>** Scrap this page and create a list of tuples cointaining the rank of the movie (from 1 to 250), the movie name and the rating value.

✅ Expected output (first 10 movies):

```python
[(1, 'The Shawshank Redemption', 9.3),
 (2, 'The Godfather', 9.2),
 (3, 'The Dark Knight', 9),
 (4, 'The Godfather Part II', 9),
 (5, '12 Angry Men', 9),
 (6, 'The Lord of the Rings: The Return of the King', 9),
 (7, 'Schindler&apos;s List', 9),
 (8, 'Pulp Fiction', 8.9),
 (9, 'The Lord of the Rings: The Fellowship of the Ring', 8.9),
 (10, 'Il buono, il brutto, il cattivo', 8.8),]
```

In [None]:
# Code here!


## Scraping in France and Europe

Let's take a look at some parts of this [slideshow.](https://inseefrlab.github.io/formation-webscraping/#/title-slide)