# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the webpage
url = 'https://github.com/trending/developers'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 3: Extract the trending developers
    developers = soup.find_all('article', class_='Box-row')
    
    for dev in developers:
        # Extract developer name
        name_tag = dev.find('h1', class_='h3 lh-condensed')
        if name_tag:
            name = name_tag.text.strip()
        else:
            name = 'N/A'
        
        # Extract username
        username_tag = dev.find('p', class_='f4 text-normal mb-1')
        if username_tag:
            username = username_tag.text.strip()
        else:
            username = 'N/A'
        
        # Extract the repository they are known for
        repo_tag = dev.find('h1', class_='h4 lh-condensed')
        if repo_tag:
            repo = repo_tag.find('a').text.strip()
        else:
            repo = 'N/A'
        
        # Print the extracted information
        print(f"Name: {name}")
        print(f"Username: {username}")
        print(f"Popular Repository: {repo}")
        print('-' * 40)
else:
    print("Failed to retrieve the page")

Name: Pete
Username: epwalsh
Popular Repository: obsidian.nvim
----------------------------------------
Name: Syed Imtiyaz Ali
Username: SyedImtiyaz-1
Popular Repository: GetTechProjects
----------------------------------------
Name: Igor
Username: igorpecovnik
Popular Repository: N/A
----------------------------------------
Name: Emil Ernerfeldt
Username: emilk
Popular Repository: egui
----------------------------------------
Name: Sebastian Raschka
Username: rasbt
Popular Repository: LLMs-from-scratch
----------------------------------------
Name: Henrik Rydgård
Username: hrydgard
Popular Repository: ppsspp
----------------------------------------
Name: uncenter
Username: uncenter
Popular Repository: purr
----------------------------------------
Name: Andrew Gunnerson
Username: chenxiaolong
Popular Repository: avbroot
----------------------------------------
Name: Xuan Son Nguyen
Username: ngxson
Popular Repository: wllama
----------------------------------------
Name: mxsm
Username:

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [30]:
import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the webpage
url = 'https://github.com/trending/developers'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 3: Extract the developer names
    name_tags = soup.find_all('h1', class_='h3 lh-condensed')
    
    # Initialize an empty list to store the cleaned names
    developer_names = []
    
    # Loop through the elements found and clean the text
    for tag in name_tags:
        name = tag.get_text(strip=True)  # Extract and clean the text
        developer_names.append(name)  # Add the cleaned name to the list
    
    # Step 4: Print the list of names
    print(developer_names)
else:
    print("Failed to retrieve the page")

['Syed Imtiyaz Ali', 'Pete', 'uncenter', 'lllyasviel', 'Sebastian Raschka', 'Vik Paruchuri', 'sigoden', 'Shubham Singodiya', "John O'Reilly", 'dkhamsing', 'Igor', 'Lars Grammel', 'Jarred Sumner', 'Eric Traut', 'Tom Payne', 'Stijn de Gooijer', 'Kiril Videlov', 'Ruslan Konviser', 'Steven Arcangeli', 'Stephen Celis', 'Stan Girard', 'Hassan El Mghari', 'continue revolution', 'Ash', 'Yair Morgenstern']


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [33]:
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the webpage
url = 'https://github.com/trending/python?since=daily'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 3: Extract the repository names
    repo_tags = soup.find_all('h2', class_='h3 lh-condensed')
    
    # Initialize an empty list to store the cleaned repository names
    repo_names = []
    
    # Loop through the elements found and clean the text
    for tag in repo_tags:
        repo_name = tag.find('a').get_text(strip=True).replace('\n', '').replace(' ', '')
        repo_names.append(repo_name)  # Add the cleaned name to the list
    
    # Step 4: Print the list of repository names
    print(repo_names)
else:
    print("Failed to retrieve the page")

['lllyasviel/Omost', 'onuratakan/gpt-computer-assistant', 'jianchang512/ChatTTS-ui', 'VikParuchuri/marker', 'ToonCrafter/ToonCrafter', 'VinciGit00/Scrapegraph-ai', 'kholia/OSX-KVM', 'isaac-sim/IsaacLab', 'PostHog/posthog', 'yt-dlp/yt-dlp', 'commaai/openpilot', 'donnemartin/system-design-primer', 'TMElyralab/Comfyui-MusePose', 'Shubhamsaboo/awesome-llm-apps', 'jianchang512/pyvideotrans', 'python-telegram-bot/python-telegram-bot', 'ReaVNaiL/New-Grad-2024', 'sherlock-project/sherlock', 'lanqian528/chat2api', 'entropy-research/Devon', 'scikit-learn/scikit-learn', 'huggingface/datatrove', 'fofr/cog-consistent-character', 'TheAlgorithms/Python', 'jxnl/instructor']


#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [34]:
import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the webpage
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 3: Locate all the image tags
    image_tags = soup.find_all('img')
    
    # Step 4: Extract the URLs of the image links
    image_links = []
    for img in image_tags:
        img_url = img.get('src')
        if img_url:
            # Check if the URL is relative or absolute
            if img_url.startswith('//'):
                img_url = 'https:' + img_url
            elif img_url.startswith('/'):
                img_url = 'https://en.wikipedia.org' + img_url
            image_links.append(img_url)
    
    # Step 5: Print the list of image links
    for link in image_links:
        print(link)
else:
    print("Failed to retrieve the page")

https://en.wikipedia.org/static/images/icons/wikipedia.png
https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg
https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg
https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg


#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [37]:
import requests
from bs4 import BeautifulSoup
import re

# Step 1: Fetch the webpage
url = 'https://www.wikipedia.org/'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 3: Locate elements containing language names and article counts
    languages = soup.find_all('a', class_='link-box')
    
    # Initialize a list to store the language names and article counts
    language_data = []
    
    # Step 4: Extract and clean the data
    for lang in languages:
        language_name = lang.find('strong').get_text(strip=True)
        article_count_text = lang.find('small').get_text(strip=True)
        # Remove non-numeric characters except for commas and plus signs
        article_count_text = re.sub(r'[^\d]', '', article_count_text)
        article_count = int(article_count_text)
        language_data.append((language_name, article_count))
    
    # Step 5: Print the language names along with the number of related articles
    for lang, count in language_data:
        print(f"{lang}: {count} articles")
else:
    print("Failed to retrieve the page")

English: 6796000 articles
日本語: 1407000 articles
Русский: 1969000 articles
Español: 1938000 articles
Deutsch: 2891000 articles
Français: 2598000 articles
Italiano: 1853000 articles
中文: 1409000 articles
فارسی: 995000 articles
Português: 1120000 articles


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [39]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Fetch the webpage
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 3: Locate the correct table
    table = soup.find('table', {'class': 'wikitable sortable'})
    
    # Step 4: Extract the table rows and columns
    rows = table.find_all('tr')
    
    # Initialize lists to store the data
    languages = []
    speakers = []
    
    # Step 5: Extract the data row by row
    for row in rows[1:]:  # Skip the header row
        cols = row.find_all('td')
        if len(cols) < 3:
            continue
        language = cols[1].get_text(strip=True)
        speaker_count = cols[2].get_text(strip=True).replace(',', '')
        
        # Check if speaker_count is numeric
        if speaker_count.isdigit():
            languages.append(language)
            speakers.append(int(speaker_count))
        
        # Stop after collecting top 10 valid entries
        if len(languages) >= 10:
            break
    
    # Step 6: Create a pandas DataFrame
    data = {'Language': languages, 'Number of Native Speakers': speakers}
    df = pd.DataFrame(data)
    
    # Display the DataFrame
    print(df)
else:
    print("Failed to retrieve the page")

Empty DataFrame
Columns: [Language, Number of Native Speakers]
Index: []


#### 3. Display Metacritic top 24 Best TV Shows of all time (TV Show name, initial release date, metascore rating, film rating system and description) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.metacritic.com/browse/tv/'

In [40]:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Fetch the webpage
url = 'https://www.metacritic.com/browse/tv/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Initialize lists to store the data
    tv_show_names = []
    release_dates = []
    metascores = []
    ratings = []
    descriptions = []
    
    # Step 3: Locate elements containing the required information
    tv_show_list = soup.find_all('td', class_='clamp-summary-wrap')
    
    # Step 4: Extract the data for the top 24 TV shows
    for tv_show in tv_show_list[:24]:
        # Extract TV show name
        name = tv_show.find('a', class_='title').get_text(strip=True)
        tv_show_names.append(name)
        
        # Extract initial release date
        release_date = tv_show.find('div', class_='clamp-details').find_all('span')[1].get_text(strip=True)
        release_dates.append(release_date)
        
        # Extract metascore rating
        metascore = tv_show.find('div', class_='metascore_w').get_text(strip=True)
        metascores.append(int(metascore))
        
        # Extract film rating system
        rating = tv_show.find('span', class_='label').get_text(strip=True)
        ratings.append(rating)
        
        # Extract description
        description = tv_show.find('div', class_='summary').get_text(strip=True)
        descriptions.append(description)
    
    # Step 5: Create a pandas DataFrame
    data = {
        'TV Show Name': tv_show_names,
        'Initial Release Date': release_dates,
        'Metascore Rating': metascores,
        'Rating System': ratings,
        'Description': descriptions
    }
    df = pd.DataFrame(data)
    
    # Display the DataFrame
    print(df)
else:
    print("Failed to retrieve the page")

Empty DataFrame
Columns: [TV Show Name, Initial Release Date, Metascore Rating, Rating System, Description]
Index: []


#### 3.1. Find the image source link and the TV show link. After you're able to retrieve, add them to your initial dataframe

In [41]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Fetch the webpage
url = 'https://www.metacritic.com/browse/tv/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Initialize lists to store the data
    tv_show_names = []
    release_dates = []
    metascores = []
    ratings = []
    descriptions = []
    image_links = []
    tv_show_links = []
    
    # Step 3: Locate elements containing the required information
    tv_show_list = soup.find_all('td', class_='clamp-summary-wrap')
    
    # Step 4: Extract the data for the top 24 TV shows
    for tv_show in tv_show_list[:24]:
        # Extract TV show name
        name = tv_show.find('a', class_='title').get_text(strip=True)
        tv_show_names.append(name)
        
        # Extract initial release date
        release_date = tv_show.find('div', class_='clamp-details').find_all('span')[1].get_text(strip=True)
        release_dates.append(release_date)
        
        # Extract metascore rating
        metascore = tv_show.find('div', class_='metascore_w').get_text(strip=True)
        metascores.append(int(metascore))
        
        # Extract film rating system
        rating = tv_show.find('span', class_='label').get_text(strip=True)
        ratings.append(rating)
        
        # Extract description
        description = tv_show.find('div', class_='summary').get_text(strip=True)
        descriptions.append(description)
        
        # Extract image link
        image_tag = tv_show.find_previous_sibling('td').find('img')
        image_link = image_tag['src'] if image_tag else ''
        image_links.append(image_link)
        
        # Extract TV show link
        tv_show_link = 'https://www.metacritic.com' + tv_show.find('a', class_='title')['href']
        tv_show_links.append(tv_show_link)
    
    # Step 5: Create a pandas DataFrame
    data = {
        'TV Show Name': tv_show_names,
        'Initial Release Date': release_dates,
        'Metascore Rating': metascores,
        'Rating System': ratings,
        'Description': descriptions,
        'Image Link': image_links,
        'TV Show Link': tv_show_links
    }
    df = pd.DataFrame(data)
    
    # Display the DataFrame
    print(df)
else:
    print("Failed to retrieve the page")

Empty DataFrame
Columns: [TV Show Name, Initial Release Date, Metascore Rating, Rating System, Description, Image Link, TV Show Link]
Index: []


## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'


In [None]:
# your code here

#### Find the book name, price and stock availability from books to scrape website as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here

####  Display the initial 100 books available in the homepage. Once again, collect the book name, price and its stock availability.

***Hint:*** The total number of displayed books per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://books.toscrape.com/catalogue/page-'
# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []
for n in range(1, number_of_pages+1):
    link = url+str(n)+".html"
    each_page_urls.append(link)
    
each_page_urls

In [None]:
# your code here