# Welcome to my own Lab on Web Scrapping!
The goal is to understand what is webscrapping, how to use it and why it is so useful for data scientist. This lab is divided in different parts let's start and have fun!

## Useful Librairies for Web Scrapping

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Section 1: How to load a web page with request

Request allows you to get useful inforamtion on the webpage you are interested in. More specifically it allows you to send HTTP requests using Python (HTTP requests are communication protocol beyond the scope of this class but still very interesting). 
The HTTP request returns a Response Object with all the response data (content, encoding, status, etc ).There are several methods you can use to see if the web u is correc, w the page sour and morec. 

Below are some examples:

In [6]:
url1 = "https://earlham.edu"
response = requests.get(url)
print(response.status_code)  # Check if the request was successful
print(response.text)  # View the page source

403
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>



In [19]:
url2 = "https://example.com"
response = requests.get(url2)
print(response.status_code)  # Check if the request was successful
print(response.text)  # View the page source

200
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This d

### Exercise 1
Explore some of the methods you can use on the response object. Find the encoding type of the web page and the time between the request was sent and the response received. You can find more information here https://www.geeksforgeeks.org/python-requests-tutorial

Solution:

In [21]:
print(f"Time: {response.elapsed}")
print(f"Encoding: {response.encoding}")

Time: 0:00:00.087845
Encoding: UTF-8


# Section 2: Explore HTML basics

It is crucial to understand the **HTML basics** when performing web scraping. The base of any web page is made with HTML code; therefore, understanding HTML is essential for effective web scraping. A web page has an HTML structure, and we use key tags such as `head`, `title`, and `body` to indicate the format and information we want to extract from the page. 

Below are some useful tags and their descriptions:

- **`<!DOCTYPE html>`**: Declares that the document is an HTML5 document.
- **`<html>`**: The root element of an HTML page.
- **`<head>`**: Contains meta information about the HTML page.
- **`<title>`**: Specifies a title for the HTML page (shown in the browser's title bar or the page's tab).
- **`<body>`**: Defines the document's body and contains all the visible content, such as headings, paragraphs, and images.
- **`<h1>`**: Defines a large heading.
- **`<p>`**: Defines a paragraph.
agraph

Here is an example of a simple webpage. From the first part of the lab we can see I use a href attribute to indicate a hyperlink

In [24]:
html = """
<html>
    <head><title>Vassily</title></head>
    <body>
        <h1>Hello, World!</h1>
        <a href="https://example.com">Visit Example</a>
    </body>
</html>
"""

I can use BeautifulSoup to parse info from this page

In [23]:
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)  # Output: Example
print(soup.h1.text)     # Output: Hello, World!
print(soup.a['href'])  

Example
Hello, World!
https://example.com


### Exercise 2
Make your own html page on something you like, add hyper links and use Beautiful soup to parse info.

Solution

In [28]:
my_web_page = """
<html>
    <head><title>Golf</title></head>
    <body>
        <h1>What is golf?</h1>
        <p>The modern game of golf originated in 15th century Scotland. The 18-hole round was created at the Old Course at St Andrews in 1764. Golf's first major, and the world's oldest golf tournament, is The Open Championship, also known as The Open, which was first played in 1860 at the Prestwick Golf Club in Ayrshire, Scotland. This is one of the four major championships in men's professional golf, the other three being played in the United States: The Masters, the U.S. Open, and the PGA Championship. Below is more information about golf !</p>
        <a href="https://https://en.wikipedia.org/wiki/Golf">More info on golf</a>
    </body>
</html>
"""

In [29]:
soup = BeautifulSoup(my_web_page, 'html.parser')
print(soup.title.text)  # Output: Example
print(soup.h1.text)     # Output: Hello, World!
print(soup.a['href'])  

Golf
What is golf?
https://https://en.wikipedia.org/wiki/Golf


Now that you know how html pages work it is time to use these skills on real webpage and start collecting data!

# Section 3: Collect data from real webpage

After some practice we are now able to understand the structure of an html page. With that in mind, we can collect data on any webpage without even having to look at it. We can automate the collection of data from webpage only using their html structure. Let's say we want to collect the news titles of an online magazine. We can automate the process with the skills we developed so far.

For instance, I know to know the news title of cnn, we can do the following:

In [32]:
url = "https://cnn.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all article titles and their links
titles = soup.find_all('a')
for i, title in enumerate(titles, start=1):
    print(f"{i}: {title.text} ({title['href']})")


1: 



 (https://www.cnn.com)
2: 
                  
                  US
                 (https://www.cnn.com/us)
3: 
                  
                  World
                 (https://www.cnn.com/world)
4: 
                  
                  Politics
                 (https://www.cnn.com/politics)
5: 
                  
                  Business
                 (https://www.cnn.com/business)
6: 
                  
                  Health
                 (https://www.cnn.com/health)
7: 
                  
                  Entertainment
                 (https://www.cnn.com/entertainment)
8: 
                  
                  Style
                 (https://www.cnn.com/style)
9: 
                  
                  Travel
                 (https://www.cnn.com/travel)
10: 
                  
                  Sports
                 (https://www.cnn.com/sports)
11: 
                  
                  Underscored
                 (https://www.cnn.com/cnn-underscored)
12: 

KeyError: 'href'

### Exercise 3:
It is also possible to collect other info on the content of the page. For example it is possible to collect data on tables in the webpages, forms, list and images. In the following exercise get info on possible tables and images on the same website

Solution:

In [37]:
# There is no alternative text describing the images. But it is still useful to identify how many images are on the website.
images = soup.find_all('img')
for i, image in enumerate(images, start=1):
    print(f"{i}: {image.alt} ")

1: None 
2: None 
3: None 
4: None 
5: None 
6: None 
7: None 
8: None 
9: None 
10: None 
11: None 
12: None 
13: None 
14: None 
15: None 
16: None 
17: None 
18: None 
19: None 
20: None 
21: None 
22: None 
23: None 
24: None 
25: None 
26: None 
27: None 
28: None 
29: None 
30: None 
31: None 
32: None 
33: None 
34: None 
35: None 
36: None 
37: None 
38: None 
39: None 
40: None 
41: None 
42: None 
43: None 
44: None 
45: None 
46: None 
47: None 
48: None 
49: None 
50: None 
51: None 
52: None 
53: None 
54: None 
55: None 


In [43]:
# Again the meta data associated to this website is hidden. We can still get info on how many element have meta data.
# The lack of info might be due to data protection by cnn. Other public and less protected website might be easier to get info from.

images = soup.find_all('meta')
for i, image in enumerate(images, start=1):
    print(f"{i}: {image.keywords} ")

1: None 
2: None 
3: None 
4: None 
5: None 
6: None 
7: None 
8: None 
9: None 
10: None 
11: None 
12: None 
13: None 
14: None 
15: None 
16: None 
17: None 
18: None 
19: None 
20: None 
21: None 
22: None 
23: None 


# The End!
Thank you for participating and investing time in this lab, I hope you enjoyed it!
- By Vassily for DS 401