# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [13]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

response = requests.get(url)

In [14]:
response.status_code

200

In [4]:
# your code here

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [20]:
# your code here
soup = BeautifulSoup(response.content)
developers = soup.find_all('article', attrs = {'class':'Box-row d-flex'})

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    developers = soup.find_all('article', attrs={'class': 'Box-row d-flex'})

    devs_list = []

    for line in developers:
        dev_name = line.find('h1', attrs={'class': 'h3 lh-condensed'}).find('a').get_text(strip=True)

        if line.find('p', attrs={'class': 'f4 text-normal mb-1'}) is not None:
            dev_login = line.find('p', attrs={'class': 'f4 text-normal mb-1'}).find('a').get_text(strip=True)
        else:
            dev_login = ''
        
        if dev_login:
            devs_list.append(f"{dev_login} ({dev_name})")
        else:
            devs_list.append(dev_name)

    # Print the list
    print(devs_list)
else:
    print(f"Error: {response.status_code}")


    

['martinvonz (Martin von Zweigbergk)', 'tangly1024 (tangly1024)', 'doronz88', 'tmc (Travis Cline)', 'stephencelis (Stephen Celis)', 'rasbt (Sebastian Raschka)', 'sandy081 (Sandeep Somavarapu)', 'masci (Massimiliano Pippi)', 'karpathy (Andrej)', 'dessalines (Dessalines)', 'TomAFrench (Tom French)', 'dkhamsing', 'arvinxx (Arvin Xu)', 'Nashtare (Robin Salen)', "joreilly (John O'Reilly)", 'ogabrielluiz (Gabriel Luiz Freitas Almeida)', 'klieret (Kilian Lieret)', 'TodePond (Lu Wilson)', 'shibing624 (Ming Xu (徐明))', 'Zheaoli (Nadeshiko Manju)', 'DaniPopes', 'younesbelkada (Younes Belkada)', 'Schniz (Gal Schlezinger)', 'jeremydmiller (Jeremy D. Miller)', 'faddat (Jacob Gadikian)']


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [8]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [9]:
# your code here


response = requests.get(url)
response.status_code

200

In [11]:
soup = BeautifulSoup(response.content)
repository = soup.find_all('article', attrs = {'class':'Box-row'})

for line in repository:
    rep_name = line.find('h2', attrs = {'class':'h3 lh-condensed'}).find('a')
    print(rep_name.get_text(strip = True))
    # print(rep_name)

hpcaitech /Open-Sora
argilla-io /argilla
LLaVA-VL /LLaVA-NeXT
TheAlgorithms /Python
Azure /azure-sdk-for-python
pytorch /vision
AUTOMATIC1111 /stable-diffusion-webui
521xueweihan /HelloGitHub
vanna-ai /vanna
SkalskiP /courses
deepseek-ai /DeepSeek-Coder
ultralytics /ultralytics
tensorflow /models
Anjok07 /ultimatevocalremovergui
lm-sys /FastChat
danielmiessler /fabric
hpcaitech /ColossalAI


#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [12]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [13]:
# your code here

response = requests.get(url)
response.status_code

200

In [33]:
soup = BeautifulSoup(response.content)


In [15]:
image_tags = soup.find_all('img')

image_tags

[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img alt="Featured article" class="mw-file-element" data-file-height="443" data-file-width="466" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>,
 <img alt="Extended-protected article" class="mw-file-element" data-file-height="512" data-file

In [16]:
image_urls = []
base_url = 'https://en.wikipedia.org'

for img in image_tags:
        img_url = img.get('src')
        if img_url:
            # Convert relative URLs to absolute URLs
            if img_url.startswith('//'):
                img_url = 'https:' + img_url
            elif img_url.startswith('/'):
                img_url = base_url + img_url
            image_urls.append(img_url)
    
    # Print all extracted image URLs
for url in image_urls:
        print(url)


https://en.wikipedia.org/static/images/icons/wikipedia.png
https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg
https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg
https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg


#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [17]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [18]:
# your code here
response = requests.get(url)
response.status_code

200

In [19]:
soup = BeautifulSoup(response.content)

In [20]:
language_sections = soup.find_all('nav', class_='central-featured')

language_sections

[<nav aria-label="Top languages" class="central-featured" data-el-section="primary links" data-jsl10n="top-ten-nav-label">
 <!-- #1. en.wikipedia.org - 1,661,382,000 views/day -->
 <div class="central-featured-lang lang1" dir="ltr" lang="en">
 <a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English — Wikipedia — The Free Encyclopedia">
 <strong>English</strong>
 <small>6,836,000+ <span>articles</span></small>
 </a>
 </div>
 <!-- #2. ja.wikipedia.org - 202,566,000 views/day -->
 <div class="central-featured-lang lang2" dir="ltr" lang="ja">
 <a class="link-box" data-slogan="フリー百科事典" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo — ウィキペディア — フリー百科事典">
 <strong>日本語</strong>
 <small>1,419,000+ <span>記事</span></small>
 </a>
 </div>
 <!-- #3. de.wikipedia.org - 184,494,000 views/day -->
 <div class="central-featured-lang lang3" dir="ltr" lang="de">
 <a class="link-box" data-slogan="Die freie Enzyklopädie" href="//de.

In [21]:
for line in language_sections[0].find_all(['strong','small']):
  print(line.get_text(strip = True))

English
6,836,000+articles
日本語
1,419,000+記事
Deutsch
2.919.000+Artikel
Русский
1 984 000+статей
Español
1.960.000+artículos
Français
2 618 000+articles
中文
1,425,000+条目 / 條目
Italiano
1.868.000+voci
فارسی
۱٬۰۰۵٬۰۰۰+مقاله
Português
1.126.000+artigos


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [41]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [42]:
# your code here
# your code here
response = requests.get(url)
response.status_code


200

In [52]:
soup = BeautifulSoup(response.content, 'html.parser')

lang_table = soup.find('tbody')

In [55]:
print(lang_table)

<tbody><tr>
<th>Language
</th>
<th data-sort-type="number">Native speakers<br/><small>(in millions)</small>
</th>
<th>Language family
</th>
<th>Branch
</th></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>
</td>
<td>941
</td>
<td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
</td>
<td><a href="/wiki/Sinitic_languages" title="Sinitic languages">Sinitic</a>
</td></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>
</td>
<td>486
</td>
<td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
</td>
<td><a href="/wiki/Romance_languages" title="Romance languages">Romance</a>
</td></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>
</td>
<td>380
</td>
<td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
</td>
<td><a href="/wiki/Germanic_lan

In [58]:
def get_text(html_code):
    rows_data = []
    
    headers = [header.get_text(strip=True) for header in html_code.find_all('th')]
    
 
    for row in html_code.find_all('tr')[1:]:  
        row_data = {}
        
       
        cells = row.find_all('td')
        for i, cell in enumerate(cells):
           
            row_data[headers[i]] = cell.get_text(strip=True)

        if row_data:
            rows_data.append(row_data)
    
    return rows_data

    

In [64]:
df = pd.DataFrame(get_text(lang_table))
df.iloc[:11]

Unnamed: 0,Language,Native speakers(in millions),Language family,Branch
0,Mandarin Chinese,941,Sino-Tibetan,Sinitic
1,Spanish,486,Indo-European,Romance
2,English,380,Indo-European,Germanic
3,Hindi,345,Indo-European,Indo-Aryan
4,Bengali,237,Indo-European,Indo-Aryan
5,Portuguese,236,Indo-European,Romance
6,Russian,148,Indo-European,Balto-Slavic
7,Japanese,123,Japonic,Japanese
8,Yue Chinese,86,Sino-Tibetan,Sinitic
9,Vietnamese,85,Austroasiatic,Vietic


#### 3. Display Metacritic top 24 Best TV Shows of all time (TV Show name, initial release date, metascore rating, film rating system and description) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [3]:
# This is the url you will scrape in this exercise 
url = 'https://www.metacritic.com/browse/tv/'

In [12]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

In [14]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [16]:
driver.get('https://www.metacritic.com/browse/tv/')

In [17]:
#close cookie window
cookie_icon = driver.find_element(By.ID, "onetrust-accept-btn-handler")

cookie_icon.click()
time.sleep(2)

In [18]:
html = driver.page_source

In [19]:
html



In [20]:
soup = BeautifulSoup(html)
main = soup.find('div', class_='c-productListings')

In [21]:
main

<div class="c-productListings" section="detailed|1"><div class="c-productListings_grid g-grid-container u-grid-columns g-inner-spacing-bottom-large"><div class="c-finderProductCard"><a class="c-finderProductCard_container g-color-gray80 u-grid" href="/tv/planet-earth-blue-planet-ii/"><div class="c-finderProductCard_images g-outer-spacing-right-medium"><div class="c-finderProductCard_leftContainer"><div class="c-finderProductCard_imgContainer g-height-100 g-container-rounded-medium"><!-- --> <div class="c-finderProductCard_img g-height-100 g-width-100"><picture class="c-cmsImage c-cmsImage-loaded"> <img height="132" src="https://www.metacritic.com/a/img/resize/35590ad56d77635bd062aacb028fb5cdce98df3e/catalog/provider/2/2/2-bda5bbc8e07f93f9e60353b98cd2e67a.jpg?auto=webp&amp;fit=cover&amp;height=132&amp;width=88" style="" width="88"/> <div aria-hidden="true" class="c-globalImagePlaceholder c-cmsImage c-cmsImage-vertical" style="display: none;"><div class="c-globalImagePlaceholder--vertica

In [22]:
product_cards = main.find_all('div', class_='c-finderProductCard')


In [25]:
data = []

for card in product_cards:
    try:
        title_heading = card.find('h3', class_='c-finderProductCard_titleHeading').get_text(strip=True)
        # show_name = title_heading.split(" ", 1)[1]
        # show_name = " ".join(title_heading.split(" ")[0:])
        show_name = re.sub(r'^\d+\.\s*', '', title_heading)
    except (IndexError, AttributeError):
        show_name = None
    
    try:
        meta_info = card.find('div', class_='c-finderProductCard_meta').find_all('span')
        release_date = meta_info[0].get_text(strip=True)
        rating_system = meta_info[2].get_text(strip=True).replace("Rated ", "")
    except (IndexError, AttributeError):
        release_date = None
        rating_system = None
    
    try:
        description = card.find('div', class_='c-finderProductCard_description').get_text(strip=True)
    except AttributeError:
        description = None
    
    try:
        metascore = card.find('div', class_='c-siteReviewScore').get_text(strip=True)
    except AttributeError:
        metascore = None

    # names = extract_person_names(cast_html)
    
    # Append extracted data to the list
    data.append([show_name, release_date, rating_system, metascore, description])

# Create a DataFrame from the extracted data
df = pd.DataFrame(data, columns=['TV Show Name', 'Initial Release Date', 'Film Rating System', 'Metascore Rating', 'Description'])

In [26]:
display(df)

Unnamed: 0,TV Show Name,Initial Release Date,Film Rating System,Metascore Rating,Description
0,Planet Earth: Blue Planet II,"Jan 20, 2018",RatedTV-G,97,"Airing simultaneously on AMC, BBC America, IFC..."
1,The Office (UK),"Jan 23, 2003",RatedTV-MA,97,"""Trust, encouragement, reward, loyalty... sati..."
2,America to Me,"Aug 26, 2018",RatedTV-14,96,The 10-part documentary series from Steve Jame...
3,O.J.: Made in America,"May 20, 2016",RatedTV-MA,96,The seven-and-a-half-hour documentary chronicl...
4,Bo Burnham: Inside,"May 30, 2021",RatedTV-MA,96,The musical comedy special was filmed by the c...
5,The U.S. and the Holocaust,"Sep 18, 2022",RatedTV-14,96,"Narrated by Peter Coyote, the three-part docum..."
6,Planet Earth II,"Feb 18, 2017",RatedTV-G,96,"Narrated by David Attenborough, the sequel to ..."
7,The Staircase,2004,RatedTV-MA,95,An 8-part documentary series about the celebra...
8,The Larry Sanders Show,"Aug 15, 1992",RatedTV-MA,95,Comic Garry Shandling draws upon his own talk ...
9,Homicide: Life on the Street,"Jan 31, 1993",RatedTV-14,95,This series was the most reality-based police ...


In [27]:
# search for directors;

movie_icon = driver.find_element(By.CLASS_NAME, "c-finderProductCard")

movie_icon.click()
time.sleep(2)


In [28]:
cast_html = driver.page_source


In [29]:
cast_html



Comment for Sandra: Hi Sandra, if you are reading this seems that i'm lost on some moment of this lab=( i didn't manage how to get 'where to watch links'and spent weeks on it. Seems that i overcomplicated this task, so i will be glad to hear any advices or hint form you.

In [33]:
def extract_cast(html_content):
    # Parse the HTML content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract cast names
    person_cards = soup.find_all('div', class_='c-globalPersonCard')
   
    
    cast_names = []
    for card in person_cards:
        try:
            name = card.find('h3', class_='c-globalPersonCard_name').get_text(strip=True)
            cast_names.append(name)
        except AttributeError:
            continue

        # Extract image URLs
        image_urls = []
        try:
            image = card.find('img').get('src')
            image_urls.append(image)
        except AttributeError:
                continue
    return cast_names


In [34]:
extract_cast(cast_html)

['David Attenborough', 'Peter Drost', 'Roger Munns', 'Roger Horrocks']

In [36]:
def extract_img_url(driver):
    image_urls = []
    image_card = soup.find('picture', class_='c-cmsImage c-cmsImage-loaded')
    if image_card:
        image_url = image_card.find('img').get('src')
        image_urls.append(image_url)
    return image_urls

    

In [37]:
extract_img_url(cast_html)

['https://www.metacritic.com/a/img/resize/35590ad56d77635bd062aacb028fb5cdce98df3e/catalog/provider/2/2/2-bda5bbc8e07f93f9e60353b98cd2e67a.jpg?auto=webp&fit=cover&height=132&width=88']

In [None]:

def extract_data(driver):
    try:
        watch_icon = driver.find_element(By.CSS_SELECTOR, 'div[data-cy="w2w-all-button"] button')
        watch_icon.click()
        time.sleep(2)

        # Get the updated page source after clicking the button
        content = driver.page_source
        soup = BeautifulSoup(content, 'html.parser')

        # Extract where to watch links
        where_watch = soup.find_all('div', class_='c-w2wModal_item')
        watch_links = set()
        for item in where_watch:
            a_tags = item.find_all('a')
            for a in a_tags:
                href = a.get('href')
                if href:
                    watch_links.append(href)
    except Exception as e:
        print(f"An error occurred while clicking the watch options button: {e}")
        watch_links = set()

    return list(watch_links)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

driver.get('https://www.metacritic.com/browse/tv/')
time.sleep(2)

#     # Accept cookies
# cookie_icon = driver.find_element(By.ID, "onetrust-accept-btn-handler")
# cookie_icon.click()
# time.sleep(2)
# Accept cookies
try:
    cookie_icon = driver.find_element(By.ID, "onetrust-accept-btn-handler")
    cookie_icon.click()
    time.sleep(2)
except Exception as e:
    print(f"Cookie consent button not found or not clickable: {e}")


movie_icon = driver.find_element(By.CLASS_NAME, "c-finderProductCard")
movie_icon.click()
time.sleep(2)

#### 3.1. Find the image source link and the TV show link. After you're able to retrieve, add them to your initial dataframe

In [None]:
# your code here

## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'


In [None]:
# your code here

#### Find the book name, price and stock availability from books to scrape website as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here

####  Display the initial 100 books available in the homepage. Once again, collect the book name, price and its stock availability.

***Hint:*** The total number of displayed books per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://books.toscrape.com/catalogue/page-'
# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []
for n in range(1, number_of_pages+1):
    link = url+str(n)+".html"
    each_page_urls.append(link)
    
each_page_urls

In [None]:
# your code here