# Web Scraping Tutorial

This tutorial will teach you how Python to scrap and extract data from a web page. We will use two packages, `requests` to scrap the webpage and `BeautifulSoup` to extract the data.

Many good references on web scraping are available online. I would recommend the following resources:
1. Automate Boring Stuff with Python by Al Sweigart (2020) has a chapter on Web Scraping tutorial, which can be read [online](https://automatetheboringstuff.com/2e/chapter12/).
2. Web Scraping With Python by Ryan Mitchell (2018) is a bit old book but provides a comprehensive guide to the topic.

**Goal:** We will extract the cryptocurrency market price from Etherscan website: https://etherscan.io/tokens

Your first step should always be to familiarize yourself with the website you want to scrape. Take a look at the website and try to inspect the HTML elements on the webpage.

## Step 1: Scrap a web page

Now, we are ready to scrap a webpage we want to get the data from with the `requests` package. We will use the following functions:

* `requests.get('URL')` - make a request to the specified URL
* `r.status_code` - get the status code of the request
* `r.content` - get the binary content of the page

More functions in the `requests` package are available in [its documentation](https://requests.readthedocs.io/en/latest/).

In [5]:
# First, we will import the requests package
import requests
import time

In [9]:
# Request the webpage
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.linkedin.com/jobs/search/?keywords=software%20engineer'
response = requests.get(url, headers=headers)
if response.status_code == 200:
    print("Successfully fetched the webpage")
elif response.status_code == 429:
    print("Rate limit exceeded. Retrying after a delay...")
    time.sleep(10)  # Wait for 10 seconds before retrying
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        print("Successfully fetched the webpage after retrying")
    else:
        print(f"Failed to fetch the webpage after retrying. Status code: {response.status_code}")
    print("Successfully fetched the webpage")
else:
    print(f"Failed to fetch the webpage. Status code: {response.status_code}")

    # Check the content of the webpage
    print(response.content[:500])  # Print the first 500 characters of the content

Successfully fetched the webpage


In [10]:
print(response.content[:500])  

b'<!DOCTYPE html>\n\n    \n    \n    \n    \n    \n    \n\n    \n    \n    \n    \n\n    \n    \n    \n    \n\n    \n    <html lang="en">\n      <head>\n        <meta name="pageKey" content="d_jobs_guest_search">\n<!---->          <meta name="linkedin:pageTag" content="urlType=jserp_custom;emptyResult=false">\n        <meta name="locale" content="en_US">\n        <meta id="config" data-app-version="2.0.2056" data-call-tree-id="AAYjj9N2DtnexTcI4B9PjQ==" data-multiproduct-name="jobs-guest-frontend" data-service-name="jobs-g'


In [20]:
import random
import time
from bs4 import BeautifulSoup

# 随机选择一个代理
def get_random_proxy():
    # Exclude the problematic proxy
    valid_proxies = [proxy for proxy in proxies if proxy['http'] != 'http://98.76.54.32:8080']
    return random.choice(valid_proxies)

# 构建 Indeed 搜索 URL
def build_indeed_url(keyword, location, page):
    base_url = "https://www.indeed.com/jobs"
    params = {
        "q": keyword,     # 搜索关键词
        "l": location,    # 搜索位置
        "start": page * 10  # Indeed 每页有 10 个结果
    }
    return base_url, params

# 爬取 Indeed 工作信息
def scrape_indeed_jobs(keyword, location, num_pages=1):
    for page in range(num_pages):
        url, params = build_indeed_url(keyword, location, page)
        proxy = get_random_proxy()  # 使用随机代理
        retries = 3  # 设置重试次数
        for attempt in range(retries):
            try:
                response = requests.get(url, headers=headers, params=params, proxies=proxy, timeout=(5, 10))  # 设置超时时间
                if response.status_code == 200:
                    break  # 成功获取响应，跳出重试循环
            except requests.exceptions.RequestException as e:
                print(f"Error occurred: {e}")
                if attempt < retries - 1:
                    print(f"Retrying... ({attempt + 1}/{retries})")
                    time.sleep(2)  # 等待 2 秒后重试
                else:
                    print("Max retries exceeded. Skipping this page.")
                    continue

        # 检查状态码
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            job_cards = soup.find_all('div', class_='job_seen_beacon')  # 根据 CSS 类选择职位卡片

            for job_card in job_cards:
                # 提取职位标题
                job_title_tag = job_card.find('h2', class_='jobTitle')
                job_title = job_title_tag.text if job_title_tag else 'N/A'

                # 提取公司名称
                company_tag = job_card.find('span', class_='companyName')
                company_name = company_tag.text if company_tag else 'N/A'

                # 提取工作地点
                location_tag = job_card.find('div', class_='companyLocation')
                location = location_tag.text if location_tag else 'N/A'

                # 提取发布时间
                post_time_tag = job_card.find('span', class_='date')
                post_time = post_time_tag.text if post_time_tag else 'N/A'

                print(f"职位: {job_title}")
                print(f"公司: {company_name}")
                print(f"地点: {location}")
                print(f"发布时间: {post_time}")
                print("-" * 50)
        else:
            print(f"Failed to retrieve page {page + 1}, status code: {response.status_code}")

        # 在每次请求之间添加随机延迟，模拟真人操作
        delay = random.uniform(2, 6)  # 随机延迟 2 到 6 秒
        print(f"等待 {delay:.2f} 秒...")
        time.sleep(delay)

# 使用示例，爬取 "Python Developer" 在 "London" 的职位信息
scrape_indeed_jobs("Python Developer", "London", num_pages=2)


Error occurred: HTTPSConnectionPool(host='www.indeed.com', port=443): Max retries exceeded with url: /jobs?q=Python+Developer&l=London&start=0 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x77150c047310>, 'Connection to 123.45.67.89 timed out. (connect timeout=5)'))
Retrying... (1/3)
Error occurred: HTTPSConnectionPool(host='www.indeed.com', port=443): Max retries exceeded with url: /jobs?q=Python+Developer&l=London&start=0 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x771501161960>, 'Connection to 123.45.67.89 timed out. (connect timeout=5)'))
Retrying... (2/3)
Error occurred: HTTPSConnectionPool(host='www.indeed.com', port=443): Max retries exceeded with url: /jobs?q=Python+Developer&l=London&start=0 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x771501161540>, 'Connection to 123.45.67.89 timed out. (connect timeout=5)'))
Max retries exceeded. Skipping this page.


UnboundLocalError: local variable 'response' referenced before assignment

In [22]:
# Get the header of the web page
#Beautiful Soup project

import bs4 as bs
from urllib.request import urlopen, Request
import pandas as pd
import re

#Create a header to prevent 404 error. This is necessary on sites like Angelist.
headers={'User-Agent': 'Mozilla/5.0'}

#Request an Indeed.com Webpage
source = ('https://fr.indeed.com/?advn=5964528717357619&vjk=376d6a346093a302')
req = Request(url = source, headers = headers)
html = urlopen(req).read()

#Create a parse tree with beautifulsoup
soup = bs.BeautifulSoup(html, 'lxml')


#Get all Job Titles
tagarray = []

#Search through bs parse tree to find text with the below properties (which are all the job titles)
for tag in soup.findAll('a', {'target': "_blank", 'title': True, 'data-tn-element':"jobTitle"}):
    tagarray.append(tag.get_text()) #output text-based results to array

jobtitles = pd.DataFrame(data = tagarray) #output to pandas df


#Get all Job Links
urlarray = []

#Search through bs parse tree to find links in href tag with 'clk' in them
for url in soup.findAll('a', {'href': re.compile('clk')}, {'href': re.compile('company')}):
    urlarray.append("www.indeed.com" + url.get('href')) #add 'www.indeed.com' to the href and append to array

joblinks = pd.DataFrame(data = urlarray) #ouput to pandas df

    
#Join DataFrames and rename columns
result = pd.concat([jobtitles, joblinks], axis = 1)
result.columns = ['Job Titles', 'Job Links'] 

#Print and output to a CSV
print(result)
result.to_csv('beautifulsoup.csv')

HTTPError: HTTP Error 403: Forbidden

In [None]:
# Get the content of the web page


In [None]:
# Get the text in the web page


In [None]:
# Save the content of web page


## Step 2: Load the web page as BeautifulSoup object

After we crawled the web page and download it to the local disk, we will use `BeautifulSoup` package to parse HTML file and access the content. We will use the following functions:

**1. Load the web page to BeautifulSoup**
* `soup = BeautifulSoup(html_doc, 'html.parser')` - parse the HTML content to BeautifulSoup object

In [None]:
# First, we will import the BeautifulSoup from bs4 package
from bs4 import BeautifulSoup

In [None]:
# Load the web page and parse it to BeautifulSoup


In [None]:
# Check the type of our soup object


In [None]:
# Print the content of the web page


In [None]:
# Print all the text in the webpage


**2. Get the content of the element**
* `soup.title` - get the title of the page
* `soup.title.string` - get the string in the title element
* `soup.h1` - get the H1 element in the web page
* `soup.h1.attrs` - get all attributes in the H1 element
* `soup.h1['class']` - get the class attribute in the H1 element

In [None]:
# Get the title of the page


In [None]:
# Other HTML elements also work too


In [None]:
# Get the class attribute of an element


**3. Look for the element in the web page**
* `soup.find('HTML_tag')` - get the element from an HTML tag
* `soup.find_all('HTML_tag')` - get the list of elelemts that has the specified HTML tag
* `soup.select('CSS_selector')` - get the list of elements with the specified [CSS selector](https://www.w3schools.com/cssref/css_selectors.asp)

In [None]:
# We can also get the page title using soup.find() function


In [None]:
# Get all the elements with image tag


In [None]:
# Get all the token names on the web page


## Step 3: Extract the data from the table

Now, we will extract the cryptocurrencies market price from the table.

In [None]:
# Get the table element in the web page


In [None]:
# Get the table headers


For loop over each row in the table and extract the data for each column in the row.

In [None]:
# For loop over each row in the table


    # Get all the columns in the row


    # For loop over each column and extract the string



## Step 4: Create a DataFrame table and write to a CSV file

In [None]:
import pandas as pd

In [None]:
# How many rows in the extracted data


In [None]:
# Convert the data list to DataFrame object


Split the columns with "\n"

In [None]:
# Split between token name and token symbol


In [None]:
# Split between the USD and ETH prices


In [None]:
# Split the number of holders and percent changes


Convert string into numerical columns

In [None]:
# Regular expression pattern to match numbers
pattern = r'([-+]?\d[\d,]*(?:\.\d+)?)'

In [None]:
# For each numerical column, convert the string to float numbers


    # Use df[col_name].str.extract() to extract the numbers and
    # .astype(float) to convert the string to float numbers




Last but not least, remove the bracket in token symbol column

Write the DataFrame table to CSV