<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>



# Lab 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed.

## Inspecting a Wikipedia Page
Let’s take one page from **Wikipedia** as an example.

Open the web page on [List of years in film](https://en.wikipedia.org/wiki/List_of_years_in_film) with the browser and inspect it.

It has a number of movies listed by year. We shall scrape these (focus on the years 1900 onwards) and load our results into a dataframe having the following structure:

|Year   |Movie   |URL   |
|---|---|---|
|...   |...   |...   | 

In [30]:
## Import Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
pd.set_option('display.max_colwidth', None) #enables columns to be displayed entirely

### Define the content to retrieve (webpage's URL)

In [31]:
url = "https://en.wikipedia.org/wiki/List_of_years_in_film"

In [32]:
r = requests.get(url)
if r.status_code == 200:
    page = r.content
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status Code: %d, Page Size: %d' % (r.status_code, len(page)))
else:
    print('Some problem occurred. Request Status Code: %d' % r.status_code)

Type of the variable 'page': bytes
Page Retrieved. Request Status Code: 200, Page Size: 239194


### Convert the stream of bytes into a BeautifulSoup representation

In [33]:
#ANSWER
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [34]:
#ANSWER
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-not-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of years in film - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limite

### Check the HTML's Title

In [35]:
#ANSWER
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>List of years in film - Wikipedia</title>:
Title text:List of years in film - Wikipedia:


### `<li>` tags
- This page uses the tag `li` to introduce each year in the list of films

Example:
        `<li><b><a href="/wiki/1971_in_film" title="1971 in film">1971</a></b>`

Use the find_all method to extract all `li` tags not containing any class or id attributes.

In [36]:
list_of_li_tags = soup.find_all('li', attrs={'class': None, 'id': None})

In [37]:
len(list_of_li_tags)

343

Identify those tags which correspond to the years 1900 to 2023.

In [38]:
relevant_tags = []

for li in list_of_li_tags:
    b_tag = li.find('b')
    if b_tag:
        year_text = b_tag.find('a').get_text()
      
        if year_text.isdigit():
            year = int(year_text)
            if 1900 <= year <= 2023:
                relevant_tags.append(li)  


print(relevant_tags)

[<li><b><a href="/wiki/1900_in_film" title="1900 in film">1900</a></b> – <i><a href="/wiki/Sherlock_Holmes_Baffled" title="Sherlock Holmes Baffled">Sherlock Holmes Baffled</a></i>, <i><a href="/wiki/Joan_of_Arc_(1900_film)" title="Joan of Arc (1900 film)">Joan of Arc</a></i></li>, <li><b><a href="/wiki/1901_in_film" title="1901 in film">1901</a></b> – <i><a href="/wiki/Blue_Beard_(1901_film)" title="Blue Beard (1901 film)">Blue Beard</a></i>, <i><a href="/wiki/Star_Theatre_(film)" title="Star Theatre (film)">Star Theatre</a></i>, <i><a href="/wiki/Stop_Thief!" title="Stop Thief!">Stop Thief!</a></i>, <i><a href="/wiki/Scrooge,_or,_Marley%27s_Ghost" title="Scrooge, or, Marley's Ghost">Scrooge, or, Marley's Ghost</a></i></li>, <li><b><a href="/wiki/1902_in_film" title="1902 in film">1902</a></b> – <i><a href="/wiki/A_Trip_to_the_Moon" title="A Trip to the Moon">A Trip to the Moon</a></i>, <i><a href="/wiki/The_Coronation_of_Edward_VII" title="The Coronation of Edward VII">The Coronation 

Let's focus on parsing one tag, then extend that to all tags afterwards.

In [39]:
li_tag = relevant_tags[-1]
li_tag

<li><b><a href="/wiki/2023_in_film" title="2023 in film">2023</a></b> – <i><a href="/wiki/Barbie_(film)" title="Barbie (film)">Barbie</a></i>, <i><a href="/wiki/Oppenheimer_(film)" title="Oppenheimer (film)">Oppenheimer</a></i>, <i><a href="/wiki/Poor_Things_(film)" title="Poor Things (film)">Poor Things</a></i>, <i><a href="/wiki/The_Zone_of_Interest_(film)" title="The Zone of Interest (film)">The Zone of Interest</a></i>, <i><a href="/wiki/M3GAN" title="M3GAN">M3GAN</a></i>, <i><a href="/wiki/The_Super_Mario_Bros._Movie" title="The Super Mario Bros. Movie">The Super Mario Bros. Movie</a></i>, <i><a href="/wiki/Maestro_(2023_film)" title="Maestro (2023 film)">Maestro</a></i>, <i><a href="/wiki/The_Boy_and_the_Heron" title="The Boy and the Heron">The Boy and the Heron</a></i>, <i><a href="/wiki/Once_Upon_a_Studio" title="Once Upon a Studio">Once Upon a Studio</a></i>, <i><a href="/wiki/Are_You_There_God%3F_It%27s_Me,_Margaret._(film)" title="Are You There God? It's Me, Margaret. (film)

To identify the year let us look for the pattern "x in film" in the `title` attribute of the link tag:


In [40]:
year_tag = li_tag.find('a', title = lambda x: x)
year_tag

<a href="/wiki/2023_in_film" title="2023 in film">2023</a>

From this we extract the year:

In [41]:
year_tag.text.strip()

'2023'

Next we extract the movie titles and urls:

In [13]:
movie_tags = li_tag.find_all('i')
movie_tags

[<i><a href="/wiki/Barbie_(film)" title="Barbie (film)">Barbie</a></i>,
 <i><a href="/wiki/Oppenheimer_(film)" title="Oppenheimer (film)">Oppenheimer</a></i>,
 <i><a href="/wiki/Poor_Things_(film)" title="Poor Things (film)">Poor Things</a></i>,
 <i><a href="/wiki/The_Zone_of_Interest_(film)" title="The Zone of Interest (film)">The Zone of Interest</a></i>,
 <i><a href="/wiki/M3GAN" title="M3GAN">M3GAN</a></i>,
 <i><a href="/wiki/The_Super_Mario_Bros._Movie" title="The Super Mario Bros. Movie">The Super Mario Bros. Movie</a></i>,
 <i><a href="/wiki/Maestro_(2023_film)" title="Maestro (2023 film)">Maestro</a></i>,
 <i><a href="/wiki/The_Boy_and_the_Heron" title="The Boy and the Heron">The Boy and the Heron</a></i>,
 <i><a href="/wiki/Once_Upon_a_Studio" title="Once Upon a Studio">Once Upon a Studio</a></i>,
 <i><a href="/wiki/Are_You_There_God%3F_It%27s_Me,_Margaret._(film)" title="Are You There God? It's Me, Margaret. (film)">Are You There God? It's Me, Margaret</a></i>]

Extract the movie name and url from the first of these movie tags:

In [14]:
text = movie_tags[0].text.strip()
text

'Barbie'

In [15]:
first_movie_name = movie_tags[0].text.strip()
first_movie_name

'Barbie'

The url can be extracted as follows:

In [16]:
first_movie_url_tag = movie_tags[0].find('a')['href']
'http://en.wikipedia.org' + first_movie_url_tag

'http://en.wikipedia.org/wiki/Barbie_(film)'

## Parsing all elements

Complete the code below to extract all the years, movies and movie_urls into lists:

In [17]:
years = []
movies = []
movie_urls = []

# Iterate over the <li> tags and extract the year, movie name and url
for li in relevant_tags:
    year_tag = li.find('a', title=lambda x: x and 'in film' in x)
    if year_tag:
        year = year_tag.text.strip()
        movie_tags = li_tag.find_all('i')
        
        for movie_tag in movie_tags:
            movie_title = movie_tag.text.strip()
            movie_url_tag = movie_tag.get('href', None)
            
            # Test if the movie has a URL
            if movie_url_tag and 'href' in movie_url_tag.attrs:
                movie_url = 'http://en.wikipedia.org' + movie_url
               
            # Append each year, movie title and corresponding url to the lists
                years.append(year)
                movies.append(movie_title)
                movie_urls.append(movie_url)

Create a dataframe containing this information:

In [18]:
df = pd.DataFrame({'year': years, 'movie': movies, 'movie_url': movie_urls})

In [29]:
df

Unnamed: 0,Year,Movie


**Question**: Which year had the most movies listed?


In [54]:
movies_per_year = movies_per_year[movies_per_year > 0]
movies_per_year

Series([], Name: count, dtype: int64)

Through webscraping from Wikipedia we now have a dataframe containing a list of prominent movies by year together with their Wikipedia links.



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



