<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>



# Lab 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed.

## Inspecting a Wikipedia Page
Let’s take one page from **Wikipedia** as an example.

Open the web page on [List of years in film](https://en.wikipedia.org/wiki/List_of_years_in_film) with the browser and inspect it.

It has a number of movies listed by year. We shall scrape these (focus on the years 1900 onwards) and load our results into a dataframe having the following structure:

|Year   |Movie   |URL   |
|---|---|---|
|...   |...   |...   | 

In [1]:
## Import Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
pd.set_option('display.max_colwidth', None) #enables columns to be displayed entirely

### Define the content to retrieve (webpage's URL)

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_years_in_film"

In [3]:
# Retrieve the page content
r = requests.get(url)
if r.status_code == 200:
    page = r.content
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status Code: %d, Page Size: %d' % (r.status_code, len(page)))
else:
    print('Some problem occurred. Request Status Code: %d' % r.status_code)

Type of the variable 'page': bytes
Page Retrieved. Request Status Code: 200, Page Size: 239281


### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
# Convert to BeautifulSoup object
soup = BeautifulSoup(page, 'html.parser')

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

### Check the HTML's Title

In [28]:
# Check the HTML's Title
print("Page title:", soup.title.string)

Page title: List of years in film - Wikipedia


### `<li>` tags
- This page uses the tag `li` to introduce each year in the list of films

Example:
        `<li><b><a href="/wiki/1971_in_film" title="1971 in film">1971</a></b>`

Use the find_all method to extract all `li` tags not containing any class or id attributes.

In [19]:
# Extract all li tags without class or id attributes
list_of_li_tags = soup.find_all('li', class_=False, id=False)
print(f"Total number of <li> tags without class or id: {len(list_of_li_tags)}")

Total number of <li> tags without class or id: 343


In [7]:
len(list_of_li_tags)

343

Identify those tags which correspond to the years 1900 to 2023.

In [20]:
relevant_tags = list_of_li_tags[-124:]  # Assuming 2023 is the last year
print(f"Number of relevant tags (1900-2023): {len(relevant_tags)}")

Number of relevant tags (1900-2023): 124


Let's focus on parsing one tag, then extend that to all tags afterwards.

In [22]:
li_tag = relevant_tags[-1]
print("Content of the last <li> tag:")
print(li_tag)

Content of the last <li> tag:
<li><a href="/wiki/Category:Short_description_is_different_from_Wikidata" title="Category:Short description is different from Wikidata">Short description is different from Wikidata</a></li>


To identify the year let us look for the pattern "x in film" in the `title` attribute of the link tag:


In [23]:
year_tag = li_tag.find('a', title = lambda x: x)
year_tag

<a href="/wiki/Category:Short_description_is_different_from_Wikidata" title="Category:Short description is different from Wikidata">Short description is different from Wikidata</a>

From this we extract the year:

In [25]:
# Extract the year
year_tag = li_tag.find('a', title=lambda x: x and 'in film' in x)
if year_tag:
    year = year_tag.text.strip()
print(f"\nExtracted year: {year}")


Extracted year: 2024


Next we extract the movie titles and urls:

In [27]:
movie_tags = li_tag.find_all('i')
print(f"\nNumber of movies found: {len(movie_tags)}")


Number of movies found: 0


Extract the movie name and url from the first of these movie tags:

In [13]:
first_movie_name = movie_tags[0].???
first_movie_name

SyntaxError: invalid syntax (2920884761.py, line 1)

The url can be extracted as follows:

In [None]:
first_movie_url_tag = movie_tags[0].find('a')['href']
'http://en.wikipedia.org' + first_movie_url_tag

## Parsing all elements

Complete the code below to extract all the years, movies and movie_urls into lists:

In [14]:
years = []
movies = []
movie_urls = []

for li in relevant_tags:
    year_tag = li.find('a', title=lambda x: x and 'in film' in x)
    if year_tag:
        year = year_tag.text.strip()
        movie_tags = li.find_all('i')
        
        for movie_tag in movie_tags:
            movie_title = movie_tag.text.strip()
            movie_url_tag = movie_tag.find('a')
            
            if movie_url_tag and 'href' in movie_url_tag.attrs:
                movie_url = 'http://en.wikipedia.org' + movie_url_tag['href']
               
                years.append(year)
                movies.append(movie_title)
                movie_urls.append(movie_url)

Create a dataframe containing this information:

In [15]:
df = pd.DataFrame({'year': years, 'movie': movies, 'movie_url': movie_urls})

In [16]:
df

Unnamed: 0,year,movie,movie_url
0,1943,The Song of Bernadette,http://en.wikipedia.org/wiki/The_Song_of_Bernadette_(film)
1,1943,Heaven Can Wait,http://en.wikipedia.org/wiki/Heaven_Can_Wait_(1943_film)
2,1943,Phantom of the Opera,http://en.wikipedia.org/wiki/Phantom_of_the_Opera_(1943_film)
3,1943,The Life and Death of Colonel Blimp,http://en.wikipedia.org/wiki/The_Life_and_Death_of_Colonel_Blimp
4,1943,For Whom the Bell Tolls,http://en.wikipedia.org/wiki/For_Whom_the_Bell_Tolls_(film)
...,...,...,...
862,2024,Civil War,http://en.wikipedia.org/wiki/Civil_War_(film)
863,2024,Deadpool & Wolverine,http://en.wikipedia.org/wiki/Deadpool_%26_Wolverine
864,2024,Inside Out 2,http://en.wikipedia.org/wiki/Inside_Out_2
865,2024,Dune: Part Two,http://en.wikipedia.org/wiki/Dune:_Part_Two


In [18]:
# Display the first few rows of the dataframe
print(df.head())

   year                                movie  \
0  1943               The Song of Bernadette   
1  1943                      Heaven Can Wait   
2  1943                 Phantom of the Opera   
3  1943  The Life and Death of Colonel Blimp   
4  1943              For Whom the Bell Tolls   

                                                          movie_url  
0        http://en.wikipedia.org/wiki/The_Song_of_Bernadette_(film)  
1          http://en.wikipedia.org/wiki/Heaven_Can_Wait_(1943_film)  
2     http://en.wikipedia.org/wiki/Phantom_of_the_Opera_(1943_film)  
3  http://en.wikipedia.org/wiki/The_Life_and_Death_of_Colonel_Blimp  
4       http://en.wikipedia.org/wiki/For_Whom_the_Bell_Tolls_(film)  


**Question**: Which year had the most movies listed?


In [17]:
#ANSWER:
# Find the year with the most movies listed
year_with_most_movies = df['year'].value_counts().index[0]
number_of_movies = df['year'].value_counts().iloc[0]

print(f"The year with the most movies listed is {year_with_most_movies} with {number_of_movies} movies.")

The year with the most movies listed is 2002 with 22 movies.


Through webscraping from Wikipedia we now have a dataframe containing a list of prominent movies by year together with their Wikipedia links.



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



