# Introduction To Data Science - Final Project

## Group members:

| Name              | ID       |
|-------------------|----------|
| Pham Dang Son Ha |          |
| Tran Dai Nien     | 21127664 |
| Nguyen Cao Khoi   |          |
| Nguyen Phan Minh Triet  | 21126007  |

## Table of Contents

1. [Data Collection](#data-collection)

2. [Data Preprocessing and Exploration](#data-preprocessing-and-exploration)

3. [Data Modeling](#data-modeling)

4. [Reference](#references)

## Data Collection

# 1. Set-up environment

### Import Required Libraries: Import the necessary Python libraries - requests, BeautifulSoup, pandas, and time.

In [1]:
#Necessary Packages
!pip install bs4
!pip install requests
!pip install pandas
!pip install numpy
import time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup



# 2. Collect data from a website by parsing HTML

### List of collected information

Information related to the movie, including:

- `names`: Movie titles.
- `years`: Release years of the movies.
- `genres`: Categories or genres the movies belong to.
- `lengths`: Duration or length of the movies.
- `rating_stars`: Ratings received by the movies.
- `metascores`: Metascores assigned to the movies (if available).
- `votes`: Total votes accumulated by the movies.
- `grosses`: Box office gross earnings of the movies (if available).
- `directors`: Directors of the movies.
- `stars`: Lead actors/actresses in the movies.
- `descriptions`: Synopsis or descriptions of the movies.

### Data Collection Process:

- Identify the URL of the webpage containing the list of movies to be scraped.
- Use the requests library to send GET requests to each page of the IMDb website.
- Parse the HTML of the webpage using BeautifulSoup to extract information about the movies.
- Iterate through each movie to collect details such as title, release year, genre, rating, - Metascore, votes, earnings, director, main cast, and description.
- Store the collected information in a DataFrame using the pandas library.

In [2]:
def collect_data(base_url, num_movies, movies_per_page=100):
    # Initialize lists for storing data
    names = []
    years = []
    genres = []
    lengths = []
    rating_stars = []
    metascores = []
    votes = []
    grosses = []
    directors = []
    stars = []
    descriptions = []

    # Iterate over the specified number of pages
    for page in range(1, int(num_movies / movies_per_page) + 1):
        try:
            # Construct the URL for the current page
            url = f"{base_url}&page={page}"
            
            # Send a GET request to the URL
            response = requests.get(url)
            time.sleep(2)  # Respectful crawling by adding delay

            # Check if the response status code is 200 (OK)
            if response.status_code == 200:
                # Parse the HTML content of the page
                soup = BeautifulSoup(response.text, 'html.parser')

                # Find all movie containers on the page
                movies = soup.find_all('div', class_='lister-item-content')

                # Process each movie
                for movie in movies:
                    # Extract movie details
                    name = movie.find('h3').find('a').text.strip()
                    year = movie.find('span', class_='lister-item-year').text.strip('()')
                    genre = movie.find('span', class_='genre').text.strip()
                    length = movie.find('span', class_='runtime').text.strip().split()[0]
                    rating = movie.find('span', class_='ipl-rating-star__rating').text.strip()

                    # Some movies might not have a metascore
                    metascore_tag = movie.find('span', class_='metascore')
                    metascore = metascore_tag.text.strip() if metascore_tag else 'N/A'

                    # Extract votes and gross, if available
                    nv_tags = movie.find_all('span', attrs={'name': 'nv'})
                    vote = nv_tags[0].text if nv_tags else 'N/A'
                    gross = nv_tags[1].text if len(nv_tags) > 1 else 'N/A'

                    # Extract director and stars
                    director, *star_list = movie.find_all('a', href=lambda href: href and 'name/nm' in href)
                    director = director.text
                    stars_str = ', '.join(star.text for star in star_list)

                    # Extract description
                    description = movie.find_all('p', class_='')[-1].text.strip()

                    # Append the extracted data to respective lists
                    names.append(name)
                    years.append(year)
                    genres.append(genre)
                    lengths.append(length)
                    rating_stars.append(rating)
                    metascores.append(metascore)
                    votes.append(vote)
                    grosses.append(gross)
                    directors.append(director)
                    stars.append(stars_str)
                    descriptions.append(description)

            else:
                print(f"Failed to process page {page}: Status code {response.status_code}")

        except requests.exceptions.RequestException as e:
            print(f"Request error on page {page}: {e}")
        except Exception as e:
            print(f"Error on page {page}: {e}")

    # Create a DataFrame with the collected data
    data = pd.DataFrame({
        'Name': names,
        'Year': years,
        'Genre': genres,
        'Length': lengths,
        'Rating': rating_stars,
        'Metascore': metascores,
        'Votes': votes,
        'Gross': grosses,
        'Director': directors,
        'Stars': stars,
        'Description': descriptions
    })

    return data

# Collecting Movie Data from IMDb

- Identify the URL of the webpage containing the list of movies to be scraped.

- Use the collect_data function to gather information from the webpage based on the desired number of pages and movies.


In [3]:
# Specify the URL containing the list of movies
url = "https://www.imdb.com/list/ls051785783/?st_dt=&mode=detail&sort=list_order,asc"

# Scrape the data
data_film = collect_data(url, 1500, 100)

# Data Storage

- Store the collected data in a CSV file named data_film.csv using data_film.to_csv().
- Read the data from the CSV file into a new DataFrame (data_film) using pd.read_csv().

In [4]:
#Save to csv file with name data_film.csv
# Save the DataFrame to a CSV file without including the index
data_film.to_csv("data_film.csv", index=False)

# Read the CSV file into a new DataFrame called data_film
data_film = pd.read_csv("data_film.csv")

# Display the 'data_film' DataFrame
data_film

Unnamed: 0,Name,Year,Genre,Length,Rating,Metascore,Votes,Gross,Director,Stars,Description
0,Bố Già,1972,"Crime, Drama",175,9.2,100.0,1967180,$134.97M,Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Diane Ke...","Don Vito Corleone, head of a mafia family, dec..."
1,Chuyện Tình Thế Chiến,1942,"Drama, Romance, War",102,8.5,100.0,595530,$1.02M,Michael Curtiz,"Humphrey Bogart, Ingrid Bergman, Paul Henreid,...",A cynical expatriate American cafe owner strug...
2,Sinh Viên Tốt Nghiệp,1967,"Comedy, Drama, Romance",106,8.0,83.0,284817,$104.95M,Mike Nichols,"Dustin Hoffman, Anne Bancroft, Katharine Ross,...",A disillusioned college graduate finds himself...
3,Công Dân Kane,1941,"Drama, Mystery",119,8.3,100.0,458887,$1.59M,Orson Welles,"Orson Welles, Joseph Cotten, Dorothy Comingore...",Following the death of publishing tycoon Charl...
4,12 Người Đàn Ông Giận Dữ,1957,"Crime, Drama",96,9.0,97.0,841356,$4.36M,Sidney Lumet,"Henry Fonda, Lee J. Cobb, Martin Balsam, John ...",The jury in a New York City murder trial is fr...
...,...,...,...,...,...,...,...,...,...,...,...
1495,Quái Thú Vô Hình,1987,"Action, Adventure, Horror",107,7.8,47.0,444578,$59.74M,John McTiernan,"Arnold Schwarzenegger, Carl Weathers, Kevin Pe...",A team of commandos on a mission in a Central ...
1496,Chuyến Du Lịch Châu Âu,2004,Comedy,92,6.6,45.0,218869,$17.72M,Jeff Schaffer,"Alec Berg, David Mandel, Scott Mechlowicz, Jac...","Dumped by his girlfriend, a high school grad d..."
1497,Champagne,1928,Comedy,86,5.4,,2572,,Alfred Hitchcock,"Betty Balfour, Jean Bradin, Ferdinand von Alte...",A spoiled heiress defies her father by running...
1498,Thế Giới Không Đủ,1999,"Action, Adventure, Thriller",128,6.4,57.0,207600,$126.94M,Michael Apted,"Pierce Brosnan, Sophie Marceau, Robert Carlyle...",James Bond uncovers a nuclear plot while prote...


## Data Preprocessing and Exploration

## Data modeling

## References