# `DATA COLLECTION`
# **TOPIC: FILMS ANALYSIS**
`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

***

## **OBJECTIVES**

In this phase, a Data Collection pipeline will be developed to collect and identify relevant information about the top lifetime gross revenue of films from Box Office Mojo, using web scraping tool Requests. 

The scraped data will be organized into a well-defined table format, including nine attributes (rank, title, foreign %, domestic %, year, genre, director, writer, cast) and 1,000 records to ensure sufficient coverage and data diversity for preprocessing and exploration in the next phase.

## **IMPLEMENTATION WITH EXPLANATION**

### **SETUP AND IMPORTS**

There are three libraries imported in this phase:
- **XX**:
- **XX**:
- **XX**:

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### **WEB SCRAPING CONFIGURATION**

In this step, the base URL is defined with a placeholder {}, allowing dynamically insertion of different offset values. This helps to scrape multiple pages from the website by adjusting the offset for each request. 

Before sending requests to the website, it is necessary to setup a user-agent header in order to avoid detection as a automated bot, preventing the server from blocking the request.

In [2]:
base_url = "https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset={}"
headers = {
    'User-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41'
    }

### **DATA EXTRACTION**

#### **OVERALL INFORMATION**

For the overall information of films, six empty lists are initialized to store the following attributes:
- `ranks`: The film's rank in the top lifetime grosses.
- `titles`: The film's name.
- `links`: The link to the film's description page.
- `foreign_grosses`: The percentage of the foreign grosses in the film's worldwide grosses.
- `domestic_grosses`: The percentage of the domestic grosses in the film's worldwide grosses.
- `years`: The year that the film was first released.

The domain link of Box Office Mojo is defined for constructing the full URL to each film's description page. 

A loop is then defined to iterate through values of offset within range from 0 to 1000 in steps of 200. This range is chosen as each page on the website displays 200 films and there are 1000 films in total. In each iteration:
- The value of offset is inserted in the base_url to create the full link.
- `requests.get(url, headers)` sends a GET request to the constructed `url` with custom headers to simulate a real browser.
- `response.raise_for status()` checks if the request is successful or not. If the request failed, it will raise an HTTPError for better error handling.
- `BeautifulSoup` parses the HTML content for easier extraction.
- The data is extracted step by step, using `soup.find_all()`.

In [3]:
ranks, titles, links, foreign_grosses, domestic_grosses, years  = [], [], [], [], [], []
domain_url = "https://www.boxofficemojo.com"

for offset in range(0, 1000, 200):
    url = base_url.format(offset)

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')

    for rank in soup.find_all('td', class_ = 'a-text-right mojo-header-column mojo-truncate mojo-field-type-rank'):
        ranks.append(rank.text)

    for element in soup.find_all('td', class_ = 'a-text-left mojo-field-type-title'):
        titles.append(element.text)
        links.append(domain_url + element.a['href'])

    for gross in soup.find_all('td', class_ = 'a-text-right mojo-field-type-percent')[1::2]:
        foreign_grosses.append(gross.text)
    
    for gross in soup.find_all('td', class_ = 'a-text-right mojo-field-type-percent')[::2]:
        domestic_grosses.append(gross.text)

    for year in soup.find_all('td', class_ = 'a-text-left mojo-field-type-year'):
        years.append(year.text)

#### **DETAILED INFORMATION**

For the detailed information, four empty lists are initialized to store the following attributes:
- `genres`: The genre(s) associated with each film.
- `directors`: The director(s) of each film.
- `writers`: The writer(s) credited for each film.
- `casts`: The main cast members of each film.

In [4]:
genres, crew_urls = [], []

for link in links:
    response = requests.get(link, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')

    genres.append(", ".join([genre.text for genre in soup.find_all('div', class_ = 'a-section a-spacing-none')[-2].find_all('span')[1:]]))
    crew_urls.append(domain_url + soup.find_all('a', class_ = 'a-size-base a-link-normal mojo-navigation-tab')[0]['href'])
    
    

In [5]:
directors, writers, casts = [], [], []

for url in crew_urls:
    director_container, writer_container, cast_container = [], [], []
    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')

    for director in soup.find_all('table', class_ = 'a-bordered a-horizontal-stripes a-spacing-base a-size-base-plus')[0].find_all('tr')[1:]:
        role = director.text.split('\n\n')[1]
        name = director.text.split('\n\n')[0]

        if role == 'Director':
            director_container.append(name)
        elif role == 'Writer':
            writer_container.append(name)

    for cast in soup.find_all('table', class_ = 'a-bordered a-horizontal-stripes a-spacing-base a-size-base-plus')[1].find_all('tr')[1:]:
        cast_container.append(cast.text.split('\n\n')[0])

    directors.append(", ".join(director_container))
    writers.append(", ".join(writer_container))
    casts.append(", ".join(cast_container))


In [6]:
films = pd.DataFrame({
    'Rank': ranks,
    'Title': titles,
    'Foreign %': foreign_grosses,
    'Domestic %': domestic_grosses,
    'Year': years,
    'Genre': genres,
    'Director': directors,
    'Writer': writers,
    'Cast': casts
})

films

Unnamed: 0,Rank,Title,Foreign %,Domestic %,Year,Genre,Director,Writer,Cast
0,1,Avatar,73.1%,26.9%,2009,Action\n \n Adventure\n \n ...,James Cameron,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver..."
1,2,Avengers: Endgame,69.3%,30.7%,2019,Action\n \n Adventure\n \n ...,"Anthony Russo, Joe Russo","Christopher Markus, Stephen McFeely, Stan Lee,...","Robert Downey Jr., Chris Evans, Mark Ruffalo, ..."
2,3,Avatar: The Way of Water,70.5%,29.5%,2022,Action\n \n Adventure\n \n ...,James Cameron,"James Cameron, Rick Jaffa, Amanda Silver, Jame...","Sam Worthington, Zoe Saldana, Sigourney Weaver..."
3,4,Titanic,70.2%,29.8%,1997,Drama\n \n Romance,James Cameron,James Cameron,"Leonardo DiCaprio, Kate Winslet, Billy Zane, K..."
4,5,Star Wars: Episode VII - The Force Awakens,54.8%,45.2%,2015,Action\n \n Adventure\n \n ...,J.J. Abrams,"Lawrence Kasdan, J.J. Abrams, Michael Arndt, G...","Daisy Ridley, John Boyega, Oscar Isaac, Domhna..."
...,...,...,...,...,...,...,...,...,...
995,996,The Final Destination,64.3%,35.7%,2009,Horror\n \n Thriller,David R. Ellis,"Eric Bress, Jeffrey Reddick","Nick Zano, Krista Allen, Andrew Fiscella, Bobb..."
996,997,Atlantis: The Lost Empire,54.8%,45.2%,2001,Action\n \n Adventure\n \n ...,"Gary Trousdale, Kirk Wise","Tab Murphy, Kirk Wise, Gary Trousdale, Joss Wh...","Michael J. Fox, Jim Varney, Corey Burton, Clau..."
997,998,Inside Man,52.4%,47.6%,2006,Crime\n \n Drama\n \n Myst...,Spike Lee,Russell Gewirtz,"Denzel Washington, Clive Owen, Jodie Foster, C..."
998,999,The Waterboy,13.2%,86.8%,1998,Comedy\n \n Sport,Frank Coraci,"Tim Herlihy, Adam Sandler","Adam Sandler, Kathy Bates, Henry Winkler, Fair..."


In [7]:
films.to_csv("films_data.csv", sep='\t', encoding='utf-8', index=False, header=True)