# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [6]:
# your code here

# Send a GET request to the URL
response = requests.get(url)

#Check the status code of the response
response.status_code

200

In [13]:
# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Print the parsed HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-f552bab6ce72.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-4589f64a2275.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [14]:
# your code here

# Find all the elements that contain the developer names
developer_elements = soup.find_all('h1', class_='h3 lh-condensed')

# Extract and clean the developer names

developer_names = []

for element in developer_elements:
    # Get the text content and clean it
    name = element.get_text(strip=True)
    developer_names.append(name)
    

print(developer_names) # OK but this doesn't look like the example
    


['Meng Zhang', 'Marcos Cáceres', 'Emil Ernerfeldt', 'Andrej', 'jdx', 'Yiming Cui', 'Vinicius Stock', 'Sebastian Raschka', 'Adrien Barbaresi', "Nick O'Leary", 'Edmund Hung', 'Marc Seitz', 'Syed Imtiyaz Ali', 'Gregory Haerr', 'Pavel Iakubovskii', 'Vik Paruchuri', 'astaxie', 'Jan Janssen', 'Miles Cranmer', 'John Sundell', 'yhirose', 'Parker Lougheed', 'Olivier Poitrey', 'Steve Hipwell', 'Clement Tsang']


In [23]:
 # Find all the elements that contain the developer names and usernames
    developer_elements = soup.find_all('article', class_='Box-row')

    # Extract and clean the developer names and usernames
    developers = []
    for element in developer_elements:
        # Get the full name
        full_name_tag = element.find('h1', class_='h3 lh-condensed')
        if full_name_tag:
            full_name = full_name_tag.get_text(strip=True)
        else:
            full_name = None

        # Get the username
        username_tag = element.find('p', class_='f4 text-normal mb-1')
        if username_tag:
            username = username_tag.get_text(strip=True)
        else:
            username = None

        # Format as 'username (Full Name)' if both are present
        if full_name and username:
            formatted_name = f"{username} ({full_name})"
        elif full_name:  # Use full_name if username is not available
            formatted_name = full_name
        elif username:  # Use username if full_name is not available
            formatted_name = username
        else:
            formatted_name = None

        if formatted_name:
            developers.append(formatted_name)
    
    # Print the list of developer names
    print(developers)


['wsxiaoys (Meng Zhang)', 'marcoscaceres (Marcos Cáceres)', 'emilk (Emil Ernerfeldt)', 'karpathy (Andrej)', 'jdx', 'ymcui (Yiming Cui)', 'vinistock (Vinicius Stock)', 'rasbt (Sebastian Raschka)', 'adbar (Adrien Barbaresi)', "knolleary (Nick O'Leary)", 'edmundhung (Edmund Hung)', 'mfts (Marc Seitz)', 'SyedImtiyaz-1 (Syed Imtiyaz Ali)', 'ghaerr (Gregory Haerr)', 'qubvel (Pavel Iakubovskii)', 'VikParuchuri (Vik Paruchuri)', 'astaxie (astaxie)', 'jan-janssen (Jan Janssen)', 'MilesCranmer (Miles Cranmer)', 'JohnSundell (John Sundell)', 'yhirose', 'parlough (Parker Lougheed)', 'rs (Olivier Poitrey)', 'stevehipwell (Steve Hipwell)', 'ClementTsang (Clement Tsang)']


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [27]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [28]:
# your code here
# Send a GET request to the URL
response = requests.get(url)

#Check the status code of the response
response.status_code

200

In [29]:
# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Print the parsed HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-f552bab6ce72.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-4589f64a2275.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

In [30]:
# Find all the elements that contain the repository names
repository_elements = soup.find_all('h2', class_='h3 lh-condensed')

# Extract and clean the repository names 

repositories = []

for element in repository_elements:
    repository_name = element.get_text(strip=True)
    repositories.append(repository_name)
    
    
print(repositories)

['tencent-ailab /V-Express', 'danielmiessler /fabric', 'VikParuchuri /surya', 'mlflow /mlflow', 'TMElyralab /MusePose', 'truefoundry /cognita', 'pdm-project /pdm', 'Hillobar /Rope', 'mustafaaljadery /llama3v', 'infiniflow /ragflow', 'PostHog /posthog', 'goauthentik /authentik', 'vanna-ai /vanna', 'wandb /wandb', 'Shubhamsaboo /awesome-llm-apps', 'outlines-dev /outlines', 'Sinaptik-AI /pandas-ai', 'Marker-Inc-Korea /AutoRAG', 'firmai /financial-machine-learning', 'run-llama /llama_index', 'bin123apple /AutoCoder', 'PrefectHQ /prefect', 'VikParuchuri /marker', 'adysec /wechat_sqlite', 'RUC-NLPIR /FlashRAG']


In [33]:
# OK, but what if I wanted to get only the repository name, and not the username
repository_names = []

for element in repository_elements:
    full_repo_name = element.get_text(strip=True)
    repo_name = full_repo_name.split('/')[1]  # Get the part after '/'
    repository_names.append(repo_name)

# print list of repository names

print(repository_names)

['V-Express', 'fabric', 'surya', 'mlflow', 'MusePose', 'cognita', 'pdm', 'Rope', 'llama3v', 'ragflow', 'posthog', 'authentik', 'vanna', 'wandb', 'awesome-llm-apps', 'outlines', 'pandas-ai', 'AutoRAG', 'financial-machine-learning', 'llama_index', 'AutoCoder', 'prefect', 'marker', 'wechat_sqlite', 'FlashRAG']


#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [36]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [37]:
# your code here
# Send a GET request to the URL

response = requests.get(url)

# Check status code

response.status_code

200

In [38]:
# Parse the content with BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# Find image elements

image_elements = soup.find_all('img')

# Extract and clean image links

image_links = []
for element in image_elements: 
    img_src = element.get('src')
    
    # prepend the protocol
    if img_src.startswith('//'):
        img_src = 'https:' + img_src
    elif img_src.startswith('/'):
        img_src = 'https://en.wikipedia.org' + img_src
        
    image_links.append(img_src)
    
# print the list of image links

print(image_links)

['https://en.wikipedia.org/static/images/icons/wikipedia.png', 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg', 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg', 'https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png', 'https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG', 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Dis

In [39]:
# OK so that returned literally all images, including the logo and all
# let's see if I wanted to extract only the images related to the Walt Disney Wikipedia page

# Find the main content div

content_div = soup.find('div',id='mw-content-text')

# Find all the image elements within the main content div

image_elements = content_div.find_all('img')

# Extract and clean the image links
image_links = []
for element in image_elements:
    img_src = element.get('src')
    # Prepend the protocol if needed
    if img_src.startswith('//'):
        img_src = 'https:' + img_src
    elif img_src.startswith('/'):
        img_src = 'https://en.wikipedia.org' + img_src
    image_links.append(img_src)
    
# Print the list of image links
print(image_links)


['https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG', 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawing_goofy.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/WaltDisneyplansDisneylandDec1954.jpg/220px-WaltDisneyplansDisneylandDe

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [41]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [42]:
# your code here
# Send a GET request to the URL

response = requests.get(url)

# Check status code

response.status_code

200

In [54]:
# Parse the content with BeautifulSoup and specify the correct encoding

soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')

# Find the top languages section

nav_top_languages = soup.find('nav', attrs={'aria-label':"Top languages"})

# Find all the elements that contain the language names and article counts
languages_elements = nav_top_languages.find_all('div', attrs={'class':'central-featured-lang'})

# Extract and clean the language names and article counts

languages = []

for element in languages_elements:
    language_name = element.find('strong').get_text(strip=True)
    article_count = element.find('small').get_text(strip=True)
    languages.append((language_name, article_count))
    
# print the list of languages and article counts

for language, article in languages:
    print(f"{language}: {article}")

English: 6,796,000+articles
日本語: 1,407,000+記事
Русский: 1 969 000+статей
Español: 1.938.000+artículos
Deutsch: 2.891.000+Artikel
Français: 2 598 000+articles
Italiano: 1.853.000+voci
中文: 1,409,000+条目 / 條目
فارسی: ۹۹۵٬۰۰۰+مقاله
Português: 1.120.000+artigos


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [75]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [78]:
# your code here

# your code here
# first using for loop

# Send a GET request to the URL

response = requests.get(url)

# Check status code: 

response.status_code  # 200

# Parse the content with BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Find the correct table containing the data
tables = soup.find_all('table', {'class': 'wikitable'})
    
# Identify the correct table by examining the header
for table in tables:
    if 'Languages with at least 50 million first-language speakers' in table.text:
        target_table = table
        break
# extract data row by row

data = []
rank = 1
for row in target_table.find_all('tr')[1:11]:  # Skip header row and limit to top 10
    columns = row.find_all('td')
    if columns:
        language = columns[0].get_text(strip=True)
        native_speakers = columns[1].get_text(strip=True)
        data.append([rank, language, native_speakers])
        rank += 1
        
#Create a DataFrame

top_languages = pd.DataFrame(data, columns=['Rank','Language','Native Speakers (millions)'])

top_languages

Unnamed: 0,Rank,Language,Native Speakers (millions)
0,1,Mandarin Chinese,941
1,2,Spanish,486
2,3,English,380
3,4,Hindi,345
4,5,Bengali,237
5,6,Portuguese,236
6,7,Russian,148
7,8,Japanese,123
8,9,Yue Chinese,86
9,10,Vietnamese,85


#### 3. Display Metacritic top 24 Best TV Shows of all time (TV Show name, initial release date, metascore rating, film rating system and description) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [79]:
# This is the url you will scrape in this exercise 
url = 'https://www.metacritic.com/browse/tv/'

In [80]:
# your code here
# first using for loop

# Send a GET request to the URL

response = requests.get(url)

# Check status code: 

response.status_code


403

In [81]:
#set the headers to "pretend" to be a web browser

headers =  {'Accept-Encoding':'gzip, deflate, br, zstd',
          'Accept-Language':'en,es-419;q=0.9,es;q=0.8,pt;q=0.7',
          'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'}



In [82]:
#Send a GET request to the URL
response = requests.get(url, headers = headers)

#Check the status code
response.status_code

200

In [89]:
response.content




In [104]:
# Parse the content with BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Find the elements containing the TV show data
tv_show_elements = soup.find_all('div', class_='c-finderProductCard')
    
tv_shows_data = []
for element in tv_show_elements[:24]:  # Limit to top 24 shows
    # Extract TV show name
    tv_show_name_element = element.find('h3', class_='c-finderProductCard_titleHeading')
    tv_show_name = tv_show_name_element.get_text(strip=True) if tv_show_name_element else 'N/A'
    tv_show_name = ''.join(tv_show_name.split('.')[1:]).strip() # remove the number next to the show name

    # Extract initial release date
    initial_release_date_element = element.find('div', class_='c-finderProductCard_meta').find_all('span')[0]
    initial_release_date = initial_release_date_element.get_text(strip=True) if initial_release_date_element else 'N/A'

    # Extract metascore rating
    metascore_rating_element = element.find('div', class_='c-siteReviewScore')
    metascore_rating = metascore_rating_element.get_text(strip=True) if metascore_rating_element else 'N/A'
    
    # Extract the film rating system
    meta_spans = element.find('div', class_='c-finderProductCard_meta').find_all('span')
    film_rating = meta_spans[2].get_text(strip=True) if len(meta_spans) > 2 else 'N/A'

    # Extract description
    description_element = element.find('div', class_='c-finderProductCard_description')
    description = description_element.get_text(strip=True) if description_element else 'N/A'

    tv_shows_data.append([tv_show_name, initial_release_date, metascore_rating, film_rating, description])



In [105]:
# Create a dataframe:

tv_shows_df = pd.DataFrame(tv_shows_data, columns=['tv_show_name', 'initial_release_date','metascore_rating', 'film_rating', 'description'])
tv_shows_df.index += 1  # Start the index from 1
tv_shows_df

Unnamed: 0,tv_show_name,initial_release_date,metascore_rating,film_rating,description
1,Planet Earth: Blue Planet II,"Jan 20, 2018",97,RatedTV-G,"Airing simultaneously on AMC, BBC America, IFC..."
2,The Office (UK),"Jan 23, 2003",97,RatedTV-MA,"""Trust, encouragement, reward, loyalty... sati..."
3,America to Me,"Aug 26, 2018",96,RatedTV-14,The 10-part documentary series from Steve Jame...
4,OJ: Made in America,"May 20, 2016",96,RatedTV-MA,The seven-and-a-half-hour documentary chronicl...
5,Bo Burnham: Inside,"May 30, 2021",96,RatedTV-MA,The musical comedy special was filmed by the c...
6,The US and the Holocaust,"Sep 18, 2022",96,RatedTV-14,"Narrated by Peter Coyote, the three-part docum..."
7,Planet Earth II,"Feb 18, 2017",96,RatedTV-G,"Narrated by David Attenborough, the sequel to ..."
8,The Staircase,"May 31, 2004",95,RatedTV-MA,An 8-part documentary series about the celebra...
9,The Larry Sanders Show,"Aug 15, 1992",95,RatedTV-MA,Comic Garry Shandling draws upon his own talk ...
10,Homicide: Life on the Street,"Jan 31, 1993",95,RatedTV-14,This series was the most reality-based police ...


#### 3.1. Find the image source link and the TV show link. After you're able to retrieve, add them to your initial dataframe

In [119]:
# your code here


# Find all the image src link and TV how link


show_url_list = ["https://www.metacritic.com" + element.a['href'] for element in tv_show_elements[:24]]


print(show_url_list)



['https://www.metacritic.com/tv/planet-earth-blue-planet-ii/', 'https://www.metacritic.com/tv/the-office-uk/', 'https://www.metacritic.com/tv/america-to-me/', 'https://www.metacritic.com/tv/oj-made-in-america/', 'https://www.metacritic.com/tv/bo-burnham-inside/', 'https://www.metacritic.com/tv/the-us-and-the-holocaust/', 'https://www.metacritic.com/tv/planet-earth-ii/', 'https://www.metacritic.com/tv/the-staircase/', 'https://www.metacritic.com/tv/the-larry-sanders-show/', 'https://www.metacritic.com/tv/homicide-life-on-the-street/', 'https://www.metacritic.com/tv/the-sopranos/', 'https://www.metacritic.com/tv/city-so-real/', 'https://www.metacritic.com/tv/bleak-house/', 'https://www.metacritic.com/tv/homecoming-a-film-by-beyonce/', 'https://www.metacritic.com/tv/samurai-jack/', 'https://www.metacritic.com/tv/fleabag/', 'https://www.metacritic.com/tv/the-underground-railroad/', 'https://www.metacritic.com/tv/atlanta/', 'https://www.metacritic.com/tv/romeo-juliet/', 'https://www.metacri

In [164]:
# Find all image tags
img_tags = soup.find_all('picture',class_="c-cmsImage")

img_tags

[<picture class="c-cmsImage"><!-- --> <img height="132" src="" style="display:none;" width="88"/> <div aria-hidden="true" class="c-globalImagePlaceholder c-cmsImage c-cmsImage-vertical"><div class="c-globalImagePlaceholder--vertical o-ratio o-ratio-tall g-bg-gray80 g-container-rounded-medium"><div class="g-inner-spacing-small o-ratio_content u-flexbox u-flexbox-alignCenter u-flexbox-justifyCenter"><svg class="g-width-100" viewbox="0 0 176 40"><use class="g-fill-gray60" xlink:href="#logoWordmarkPlaceholder"></use></svg></div></div></div></picture>,
 <picture class="c-cmsImage"><!-- --> <img height="132" src="" style="display:none;" width="88"/> <div aria-hidden="true" class="c-globalImagePlaceholder c-cmsImage c-cmsImage-vertical"><div class="c-globalImagePlaceholder--vertical o-ratio o-ratio-tall g-bg-gray80 g-container-rounded-medium"><div class="g-inner-spacing-small o-ratio_content u-flexbox u-flexbox-alignCenter u-flexbox-justifyCenter"><svg class="g-width-100" viewbox="0 0 176 40"

In [166]:
# Note: Due to limitations with dynamically rendering JavaScript content on the Metacritic website
# using requests and BeautifulSoup, and challenges with setting up Selenium and ChromeDriver, 
# I will manually collect the image URLs for the top 24 TV shows. This approach ensures I get the
# correct data without encountering issues related to JavaScript rendering or driver compatibility.

# Manually collected image URLs
image_urls = [
    "https://www.metacritic.com/a/img/resize/35590ad56d77635bd062aacb028fb5cdce98df3e/catalog/provider/2/2/2-bda5bbc8e07f93f9e60353b98cd2e67a.jpg",
    "https://www.metacritic.com/a/img/resize/28a62562ee72bb8d509d9014e126ac79064c59da/catalog/provider/2/2/2-1fec3b983df97cc1748846b3a50306f2.jpg",
    "https://www.metacritic.com/a/img/resize/9d4e8a3aa52c901123f94381035fb236776f9430/catalog/provider/2/2/2-d5e0401444c377d71a3a015fd6bbe10d.jpg",
    "https://www.metacritic.com/a/img/resize/a5e2004fa36ebed445a7f47d9486da2b900a8fb9/catalog/provider/2/2/2-d51ed1c855925d51672d443b45ef6014.jpg",
    "https://www.metacritic.com/a/img/resize/214d3a987eec1ef092431ac2ce5fd82ececdd6d8/catalog/provider/2/2/2-b69395fa474f9dd2fc3ccd121b2700c6.jpg",
    "https://www.metacritic.com/a/img/resize/02efe7bbbc86d7ea9928b5654d41f5fc1062cc85/catalog/provider/2/2/2-0c07424ec26d0f487d45166f148f0be5.jpg",
    "https://www.metacritic.com/a/img/resize/a7e19a26384a48f70632f97586482b60ec7bf750/catalog/provider/2/2/2-125fd8138a3ab393e6a3b9cc23f77056.jpg",
    "https://www.metacritic.com/a/img/resize/93facf26db90310275351d1d464f873397279941/catalog/provider/2/2/2-fba2c50ee3e5767ad3f18dd878330eb1.jpg",
    "https://www.metacritic.com/a/img/resize/0163d433cc475ebd8cff86fb73b63e963034649a/catalog/provider/2/2/2-40879874a3b4b42b8625ae7da9e1a042.jpg",
    "https://www.metacritic.com/a/img/resize/efe12647a2dd36fad8f44b225e8465a547d279f7/catalog/provider/2/2/2-b30f7134847355ce7a6dbdfda13d1f22.jpg",
    "https://www.metacritic.com/a/img/resize/6852fe8b6b5ab1ffa87b59790acd11579a41557b/catalog/provider/2/2/2-74b31fc00e5d07e72b5b7522cab3613b.jpg",
    "https://www.metacritic.com/a/img/resize/eb07a7aae6f7774e386262772f354e84a33c0e62/catalog/provider/2/2/2-33a080b1e1b05c96e5783b1b4de6f7ad.jpg",
    "https://www.metacritic.com/a/img/resize/bb72b550e572f1f4d465f6a26c8ebd7dfbb39351/catalog/provider/2/2/2-fbe382d7c874739ad8a50ff28e91946a.jpg",
    "https://www.metacritic.com/a/img/resize/c70bd89b5861e8cdc7e8d4a786604c80e97a02ba/catalog/provider/2/2/2-e329c8f4b00f005c7251557ec45fe1cd.jpg",
    "https://www.metacritic.com/a/img/resize/aa3b95db1782df80377ada565a62c329c1f255bd/catalog/provider/2/2/2-cefae600f856a3df070211263445a86f.jpg",
    "https://www.metacritic.com/a/img/resize/1be0691ecfa06c020218366d5438b01c52582001/catalog/provider/2/2/2-a50aaaf3064952bae8cd68c5c6c46011.jpg",
    "https://www.metacritic.com/a/img/resize/0e9adce9b10919dac0b4476006cc3c1a7ded8b05/catalog/provider/2/2/2-be84cf8692bba160edb758201ae36914.jpg",
    "https://www.metacritic.com/a/img/resize/ecc516fb9c43d1ad51c42ab7e7f1e6d8eba012d9/catalog/provider/2/2/2-769ad4fb6decf83f4f9523ba57dfc94b.jpg",
    "",
    "https://www.metacritic.com/a/img/resize/823038428a1a17c5cc54343deec533deadfe79e1/catalog/provider/2/2/2-d72397570fc53971fe7ea3915fc49bc4.jpg",
    "https://www.metacritic.com/a/img/resize/338d00076b278e255ac01a8e96b70b0b0f0bb4e6/catalog/provider/2/2/2-17b36172849c4be2919afe63b90877f7.jpg",
    "https://www.metacritic.com/a/img/resize/d27b4ead86223cdf6b901fb3e037ea990b028ad7/catalog/provider/2/2/2-da98a382b292acca454f8fc84bc87df9.jpg",
    "https://www.metacritic.com/a/img/resize/5bf13eaa3eec8504e9acfa7f9da55e6f07fc70bc/catalog/provider/2/2/2-e0450e9cd675d0e5d6655838e9ba7345.jpg",
    "https://www.metacritic.com/a/img/resize/38162f07d45d92c1d765fd71b9ae37c089c07cf9/catalog/provider/2/2/2-473476a0bec6aefca6bcbecb263adee2.jpg"
]

In [167]:
# add the 2 new columns to the initial data frame

tv_shows_df['show_URL'] = show_url_list
tv_shows_df['image_URL'] = image_urls

tv_shows_df

Unnamed: 0,tv_show_name,initial_release_date,metascore_rating,film_rating,description,show_URL,image_URL
1,Planet Earth: Blue Planet II,"Jan 20, 2018",97,RatedTV-G,"Airing simultaneously on AMC, BBC America, IFC...",https://www.metacritic.com/tv/planet-earth-blu...,https://www.metacritic.com/a/img/resize/35590a...
2,The Office (UK),"Jan 23, 2003",97,RatedTV-MA,"""Trust, encouragement, reward, loyalty... sati...",https://www.metacritic.com/tv/the-office-uk/,https://www.metacritic.com/a/img/resize/28a625...
3,America to Me,"Aug 26, 2018",96,RatedTV-14,The 10-part documentary series from Steve Jame...,https://www.metacritic.com/tv/america-to-me/,https://www.metacritic.com/a/img/resize/9d4e8a...
4,OJ: Made in America,"May 20, 2016",96,RatedTV-MA,The seven-and-a-half-hour documentary chronicl...,https://www.metacritic.com/tv/oj-made-in-america/,https://www.metacritic.com/a/img/resize/a5e200...
5,Bo Burnham: Inside,"May 30, 2021",96,RatedTV-MA,The musical comedy special was filmed by the c...,https://www.metacritic.com/tv/bo-burnham-inside/,https://www.metacritic.com/a/img/resize/214d3a...
6,The US and the Holocaust,"Sep 18, 2022",96,RatedTV-14,"Narrated by Peter Coyote, the three-part docum...",https://www.metacritic.com/tv/the-us-and-the-h...,https://www.metacritic.com/a/img/resize/02efe7...
7,Planet Earth II,"Feb 18, 2017",96,RatedTV-G,"Narrated by David Attenborough, the sequel to ...",https://www.metacritic.com/tv/planet-earth-ii/,https://www.metacritic.com/a/img/resize/a7e19a...
8,The Staircase,"May 31, 2004",95,RatedTV-MA,An 8-part documentary series about the celebra...,https://www.metacritic.com/tv/the-staircase/,https://www.metacritic.com/a/img/resize/93facf...
9,The Larry Sanders Show,"Aug 15, 1992",95,RatedTV-MA,Comic Garry Shandling draws upon his own talk ...,https://www.metacritic.com/tv/the-larry-sander...,https://www.metacritic.com/a/img/resize/0163d4...
10,Homicide: Life on the Street,"Jan 31, 1993",95,RatedTV-14,This series was the most reality-based police ...,https://www.metacritic.com/tv/homicide-life-on...,https://www.metacritic.com/a/img/resize/efe126...


## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [181]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'


Enter the city: Lisbon


In [182]:
# Function to get weather report
def get_weather(city):
    # API key and endpoint
    api_key = '5a68dbd3fe6242678ac130253242505'
    url = f'http://api.weatherapi.com/v1/current.json?key={api_key}&q={city}&aqi=no'

    # Make a request to the weather API
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        # Extract relevant data
        temperature = data['current']['temp_c']
        wind_speed = data['current']['wind_kph']
        weather_description = data['current']['condition']['text']
        weather_icon = data['current']['condition']['icon']
        
        # Print the weather report
        print(f"Weather in {city}:")
        print(f"Temperature: {temperature}°C")
        print(f"Wind Speed: {wind_speed} kph")
        print(f"Description: {weather_description}")
        print(f"Icon URL: {weather_icon}")
    else:
        print("Error: Unable to fetch weather data. Please check the city name and try again.")
    

In [183]:
get_weather(city)

Weather in Lisbon:
Temperature: 23.0°C
Wind Speed: 16.9 kph
Description: Sunny
Icon URL: //cdn.weatherapi.com/weather/64x64/day/113.png


#### Find the book name, price and stock availability from books to scrape website as a pandas dataframe.

In [189]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [222]:
# your code here

# Send a GET request to the URL
response = requests.get(url)

# Check the status code
if response.status_code == 200:
    # parse the content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # find the elements containung books data
    books_section = soup.find_all('li', class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
    books_data = []
    for book in books_section:
        # Extract book title
        title = book.h3.a['title']
        #print(book_name)
        # Extract book price
        price = book.find('p',class_='price_color').text
        # Extract stock availability
        stock_availability = book.find('p', class_='instock availability').get_text(strip=True)
        
        # append the information in book_data
        books_data.append([title,price,stock_availability])
    
else:
     print(f"Failed to retrieve page. Status code: {response.status_code}")
    

books_df = pd.DataFrame(books_data, columns=['title','price','availability'])
books_df.index += 1

books_df

Unnamed: 0,title,price,availability
1,A Light in the Attic,£51.77,In stock
2,Tipping the Velvet,£53.74,In stock
3,Soumission,£50.10,In stock
4,Sharp Objects,£47.82,In stock
5,Sapiens: A Brief History of Humankind,£54.23,In stock
6,The Requiem Red,£22.65,In stock
7,The Dirty Little Secrets of Getting Your Dream...,£33.34,In stock
8,The Coming Woman: A Novel Based on the Life of...,£17.93,In stock
9,The Boys in the Boat: Nine Americans and Their...,£22.60,In stock
10,The Black Maria,£52.15,In stock


####  Display the initial 100 books available in the homepage. Once again, collect the book name, price and its stock availability.

***Hint:*** The total number of displayed books per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [226]:
# This is the url you will scrape in this exercise
url = 'https://books.toscrape.com/catalogue/page-'
# This is how you will loop through each page:
number_of_pages = int(100/20)
# List to store the URLs of each page
each_page_urls = [url + str(n) + ".html" for n in range(1, number_of_pages + 1)]

In [227]:
# your code here
# List to store book data
books_data = []

# Loop through each page URL and scrape data
for url in each_page_urls:
    # Send a GET request to the page URL
    response = requests.get(url)
    
    # Check the status code
    if response.status_code == 200:
        
        # parse the content with BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
    
        # find the elements containung books data
        books_section = soup.find_all('li', class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
        
        # Loop through each book in the section
        for book in books_section:
            # Extract book title
            title = book.h3.a['title']
            #print(book_name)
            # Extract book price
            price = book.find('p',class_='price_color').text
            # Extract stock availability
            stock_availability = book.find('p', class_='instock availability').get_text(strip=True)
        
            # append the information in book_data
            books_data.append([title,price,stock_availability])
    
    else:
        print(f"Failed to retrieve page. Status code: {response.status_code}")
        
# Create a DataFrame from the collected data
books_df = pd.DataFrame(books_data, columns=['title', 'price', 'availability'])

# Display the DataFrame
books_df.index += 1
books_df

Unnamed: 0,title,price,availability
1,A Light in the Attic,£51.77,In stock
2,Tipping the Velvet,£53.74,In stock
3,Soumission,£50.10,In stock
4,Sharp Objects,£47.82,In stock
5,Sapiens: A Brief History of Humankind,£54.23,In stock
...,...,...,...
96,Lumberjanes Vol. 3: A Terrible Plan (Lumberjan...,£19.92,In stock
97,"Layered: Baking, Building, and Styling Spectac...",£40.11,In stock
98,Judo: Seven Steps to Black Belt (an Introducto...,£53.90,In stock
99,Join,£35.67,In stock
