# Lab | Web Scraping Multiple Pages

### Instructions Part 2
#### Practice web scraping. This is not involved with the GNOD project of the week

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

# Part 2: Practice web scraping

## 1. Arbitrary Wikipedia page of "Python" and create a list of links

### Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Requests

In [2]:
# URL of the Wikipedia page for "Python"
url = 'https://en.wikipedia.org/wiki/Python'

This code performs the following steps:
1. It sends an HTTP request to the Wikipedia page for "Python".
2. If the request is successful, it parses the HTML content of the page.
3. Then, it finds all the `<a>` (anchor) tags with the `href` attribute (these are the links).
4. It filters and formats these links to create a full URL and adds them to a list.
5. Finally, it prints out the list of URLs.

The code includes a condition to filter out administrative links (like links to Wikipedia's editing pages) by checking if the `href` attribute contains a colon (`:`). This is a simple way to get more relevant content links. 

In [3]:
# Sending a request to the URL
response = requests.get(url)

In [4]:
# Checking if the request was successful
if response.status_code == 200:
    # Parsing the content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Finding all the 'a' tags with 'href' attribute
    links = soup.find_all('a', href=True)

    # Creating a list to hold the URLs
    urls = []

    # Extracting URLs from the 'a' tags
    for link in links:
        href = link['href']
        if href.startswith('/wiki/') and ':' not in href:  # Filtering out administrative links
            full_url = f'https://en.wikipedia.org{href}'
            urls.append(full_url)

    # Printing the list of URLs
    for url in urls:
        print(url)
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

https://en.wikipedia.org/wiki/Main_Page
https://en.wikipedia.org/wiki/Main_Page
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki/Pythonidae
https://en.wikipedia.org/wiki/Python_(genus)
https://en.wikipedia.org/wiki/Python_(mythology)
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/CMU_Common_Lisp
https://en.wikipedia.org/wiki/PERQ#PERQ_3
https://en.wikipedia.org/wiki/Python_of_Aenus
https://en.wikipedia.org/wiki/Python_(painter)
https://en.wikipedia.org/wiki/Python_of_Byzantium
https://en.wikipedia.org/wiki/Python_of_Catana
https://en.wikipedia.org/wiki/Python_Anghelo
https://en.wikipedia.org/wiki/Python_(Efteling)
https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)
https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
https://en.wikipedia.org/wiki/Python_(automobile_maker)
https://en.wikipedia.org/wiki/Python_(Ford_prototype)


## 2.  Number of Titles Changed in the United States Code

In [5]:
# Fetch the webpage content
url = 'http://uscode.house.gov/download/download.shtml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [6]:
# Find elements indicating changed titles (titles in bold)
# Find class name
changed_titles = soup.find_all('div', class_='usctitlechanged')  

# Print the number of changed titles
print(f'Number of Titles Changed: {len(changed_titles)}')

Number of Titles Changed: 1


In [7]:
titles = []
for i in soup.select('.usctitlechanged'):
    titles.append(i.text.split('.')[0].strip().replace('\n','').replace(' ٭',''))

In [8]:
titles

["Title 38 - Veterans' Benefits"]

## 3. List of Languages and Number of Articles on Wikipedia

In [9]:
# Fetch the webpage content
url = 'https://www.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [10]:
# Find language elements
languages = soup.find_all('a', class_='link-box')  

# Iterate over each language element and extract information
for language in languages:
    lang_name = language.find('strong').text.strip()
    num_articles = language.find('small').text.strip().replace('\xa0', ' ')
    print(f'{lang_name}: {num_articles}')

English: 6 744 000+ articles
Español: 1 906 000+ artículos
Русский: 1 947 000+ статей
日本語: 1 392 000+ 記事
Deutsch: 2 852 000+ Artikel
Français: 2 567 000+ articles
Italiano: 1 835 000+ voci
中文: 1 387 000+ 条目 / 條目
العربية: 1 221 000+ مقالة
Português: 1 113 000+ artigos


## 4. List of Datasets Available on data.gov.uk

In [11]:
# Fetch the webpage content
url = 'https://data.gov.uk/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [12]:
# Create an empty list to store the dataset categories
dataset_categories = []

# Look for the HTML elements that contain the categories
# They seem to be in <a> tags with a class that includes 'govuk-link'
category_elements = soup.find_all("a", class_='govuk-link')

# Iterate through each <a> tag
for element in category_elements:
    # Check if the parent element has class 'govuk-grid-column-full'
    if "govuk-grid-column-full" in element.find_parent("div")["class"]:
        # Extract the text from the <a> tag
        category_name = element.get_text().strip()
        # Check if the text contains "a"
        if category_name and "a" in category_name:
            # Add the category name to our list
            dataset_categories.append(category_name)

# Print the list of dataset categories
print(dataset_categories)

['Business and economy', 'Crime and justice', 'Education', 'Health', 'Mapping', 'Towns and cities', 'Transport', 'Digital service performance', 'Government reference data']


## 5. Top 10 Languages by Number of Native Speakers in a DataFrame

In [13]:
# Fetch the webpage content
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [14]:
# Based on the table containing the data, it has the class 'wikitable sortable static-row-numbers'
languages_table = soup.find('table', {'class': 'wikitable sortable static-row-numbers'})

# Extract the rows from the table
rows = languages_table.find_all('tr')

# The first row is the header, the rest are the data rows
header = [th.text.strip() for th in rows[0].find_all('th')]
data_rows = rows[1:11]  # We want the top 10 entries

# Extract data from each row
data = []
for row in data_rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)  # Add the cleaned column data to the data list

# Create a DataFrame with the extracted data
df = pd.DataFrame(data, columns=header)

# Display the DataFrame
df

Unnamed: 0,Language,Native speakers(millions),Language family,Branch
0,Mandarin Chinese,939.0,Sino-Tibetan,Sinitic
1,Spanish,485.0,Indo-European,Romance
2,English,380.0,Indo-European,Germanic
3,Hindi,345.0,Indo-European,Indo-Aryan
4,Portuguese,236.0,Indo-European,Romance
5,Bengali,234.0,Indo-European,Indo-Aryan
6,Russian,147.0,Indo-European,Balto-Slavic
7,Japanese,123.0,Japonic,Japanese
8,Yue Chinese,86.1,Sino-Tibetan,Sinitic
9,Vietnamese,85.0,Austroasiatic,Vietic
