### Course: Foundation of Information
### Project 1 - Part B TMDB Problem
### Student: Muhammad Kamran Syed

### Problem Statement:

<p>
A common business requirement in the context of information gathering is to extract and filter relevant
data from web pages that host this information. However, access to information spread over several
web pages, hosted potentially on multiple websites is a cumbersome process and we cannot rely on
manual procedures to execute this task. In this project, you will employ a programmatic approach to
access, parse and extract relevant information from a website of interest.
</p>

### Objective:

<p>
The project's goal is to extract data (from a chosen number of pages) from The Movie Database website
(https://www.themoviedb.org/) into a tabular data format so that further analysis (e.g., details about a
movie's genre, cast, and user rating) can be facilitated.
To execute this project, you will have to read the documentation links provided against each task in the
assignment and adapt the code examples provided in the documentation for the task at hand
</p>

### 0. Preparation:

Install the required libraries conditionally (if not already present), my own innovation :).

In [8]:
import subprocess
import sys

def pip_install_if_not_present(packageName, importName=None):
    """
    Install a package using pip if it's not already installed.

    :param packageName: The package name to install
    :param importName: In case import name is different than package name i.e. import bs4 from beautifulsoup4
    :retrun: True is package is already installed or correctly got installed. False otherwise
    """
    if importName is None:
        importName = packageName
    try:
        # Try to import the packages, if it is imported it is already installed
        __import__(importName)
        print(f"package [{packageName}] is already installed")
        return True
    except ImportError as e:
        print(f"package [{packageName}] is not installed: [{e}]")
    
    print(f"Installing package [{packageName}] now...")
    try:
        # run the pip install to install the package
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", packageName])
        # If installed correctly, it should be imported.
        __import__(importName)
    except subprocess.CalledProcessError as e:
        print(f"Error installing package [{packageName}]: {e}")
        return False
    except ImportError:
        print(f"Package [{packageName}] could not be imported after installation.")
        return False
    print(f"package [{packageName}] is installed")
    return True

In [9]:
if pip_install_if_not_present(packageName='requests'):
    None
if pip_install_if_not_present(packageName='beautifulsoup4', importName='bs4'):
    None
if pip_install_if_not_present(packageName='pandas'):
    None

package [requests] is already installed
package [beautifulsoup4] is already installed
package [pandas] is already installed


In [10]:
#For pretty printing, this is built in python standard library, no need to install
from pprint import pprint

### 1. Establish a connection to webpage [TheMovieDb](https://www.themoviedb.org/movie)

In [12]:
# define some Constants, conventionally we represent Constants as all caps
# URL to scrape
URL = 'https://www.themoviedb.org/movie'

# User-Agent header to avoid 403 Forbidden error
# Since I am in MiddleEast region (KSA), hence specifying the expected response language to be English
# Otherwise, by default Arabic response (title description) would be returned
NEEDED_HEADERS = {
    'User-Agent':"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    'Accept-Language': "en-US,en;q=0.9"
}

##### 1.a. Import requests library and formulate a get request to download the contents of the webpage:
("https://www.themoviedb.org/movie")

In [14]:
import requests

response = requests.get(url=URL, headers=NEEDED_HEADERS)
response

<Response [200]>

##### 1.b. Verify the status code of the request and confirm that the request was executed appropriately

In [16]:
if response:
    if response.status_code == 200:
        print(f"Request was successfull with status code [{response.status_code}] to [{URL}]")
    else:
        print(f"Request failed with status code [{response.status_code}] to [{URL}]")
else:
    print(f"Request failed with no response from [{URL}]")

Request was successfull with status code [200] to [https://www.themoviedb.org/movie]


##### 1.c. Print the contents of the page obtained from the response and save it in a variable

In [18]:
page_content = response.content
pprint(page_content)

(b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popul'
 b'ar Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv'
 b'="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="ke'
 b'ywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresse'
 b's, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n  '
 b'  <meta name="mobile-web-app-capable" content="yes">\n    <meta name="app'
 b'le-mobile-web-app-capable" content="yes">\n    <meta name="viewport" cont'
 b'ent="width=device-width,initial-scale=1">\n      <meta name="description"'
 b' content="The Movie Database (TMDB) is a popular, user editable database for'
 b' movies and TV shows.">\n    <meta name="msapplication-TileImage" content'
 b'="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f7'
 b'5d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" cont'
 b'ent="#032541">\n<meta name="theme-color" content=

##### 1.d. Infer the type of the variable created in part 1c and display the first 200 characters of the content from the server’s response

In [20]:
print(f"Type of content variable is [{type(page_content)}]")
print(f"Printing top 200 characters from the content using string slicing")
pprint(page_content[:200])

Type of content variable is [<class 'bytes'>]
Printing top 200 characters from the content using string slicing
(b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popul'
 b'ar Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv'
 b'="cleartype" content="on">\n    <meta charset="utf-8">\n  ')


#### 2. Parse the content of HTML response using the BeautifulSoup library and execute the tasks

##### 2.a. From the BeautifulSoup library import the BeautifulSoup class and create an instance of the BeautifulSoup class

In [23]:
from bs4 import BeautifulSoup

#Create the Soup object using html parse
soup = BeautifulSoup(page_content, 'html.parser')


##### 2.b. Extract the title of the parsed web page content

In [25]:
soup.title.string

'Popular Movies — The Movie Database (TMDB)'

##### 2.c. Write a user defined function to generalize the task presented in Q2a to any URL 

<p>
Write a user defined function to generalize the task presented in Q2a to any URL that
retrieves the content of the webpage. Your function should take a URL string as an input
and return a correctly formulated BeautifulSoup instance as the output. In your
function definition, ensure that appropriate exceptions are raised to the user (through
status codes) if they pass in malformed/incorrect URLs. Write two test cases for your
function - one with a working URL and another with an URL that gets a 404 response. (
3 marks )

<quote>Note: While pulling the request please specify a user agent string as showed in Q.1.<\quote>
</p>

In [27]:
def get_soup_object(url, headers=None):
    """
    This function takes a URL as input and returns the BeautifulSoup object of the webpage.
    It is assumed that the BeautifulSoup and requests libraries are already imported by the caller.
    The function make use of html.parser parser
    
    Args:
    url (string): The url for the web page
    headers (dictionary (String:String)): The headers to be used in the request. Default is None. It is not required to pass 
    any headers. If no values are passed then the default headers of User-Agent and Accept-Language will be used.

    Returns:
    soup (BeautifulSoup object): The BeautifulSoup object of the webpage

    Example:
    >>> get_soup('https://www.test.com')
    """

    # Set the headers conditionally, if no headers are passed
    # These headers are local to the get_soup_object function 
    if headers is None:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9"
        }

    try:
        # Send a GET request to the URL with User-Agent and Accept Language Headers
        res = requests.get(url=url, headers=headers)

        if res.status_code != 200:
            raise Exception (
                f"Request failed for [{url}] with StatusCode:[{res.status_code}] and Reason:[{res.reason}]"
            )                   
        return BeautifulSoup(res.content, 'html.parser')    
    except requests.exceptions.RequestException as e:
        raise Exception(f"Request error occoured for [{url}] with error message [{e}]")

##### 2.c.i. Test the function get_soup_object(url) with bellow test cases
* Valid URL
* Not Existing URL
* Forbidden URL
* MalFormed URL

In [29]:
# Test with a working URL that give 200
try:
    correct_url = 'https://www.examples.com'
    pprint(f"Testing with valid URL: {correct_url}")
    soup = get_soup_object(correct_url)
    pprint(f"Success: Page title = {soup.title.string}")
except Exception as e1:
    pprint(f"Error Occoured: {e1}")

# Test with a non-existing URL that gives 404
try:
    not_existing_url = 'https://www.examples.com/DoNotExists'
    pprint(f"Testing with non-existing URL: {not_existing_url}")
    soup = get_soup_object(not_existing_url)
    pprint(f"Success: Page title = {soup.title.string}")
except Exception as e2:
    pprint(f"Error Occoured: {e2}")

# Test with a forbidden URL that give 403
try:
    forbidden_url = 'https://www.amazon.com'
    pprint(f"Testing with forbidden URL: {forbidden_url}")
    soup = get_soup_object(forbidden_url)
    pprint(f"Success: Page title = {soup.title.string}")
except Exception as e3:
    pprint(f"Error Occoured: {e3}")

# Test with an malformed URL
try:
    malformed_url = 'htt://google.cm'
    pprint(f"Testing with Malformed URL: {malformed_url}")
    soup = get_soup_object(malformed_url)
    pprint(f"Success: Page title = {soup.title.string}")
except Exception as e4:
    pprint(f"Error Occoured: {e4}")

'Testing with valid URL: https://www.examples.com'
'Success: Page title = Examples - Free Interactive Resources'
'Testing with non-existing URL: https://www.examples.com/DoNotExists'
('Error Occoured: Request failed for [https://www.examples.com/DoNotExists] '
 'with StatusCode:[404] and Reason:[Not Found]')
'Testing with forbidden URL: https://www.amazon.com'
'Success: Page title = Amazon.com. Spend less. Smile more.'
'Testing with Malformed URL: htt://google.cm'
('Error Occoured: Request error occoured for [htt://google.cm] with error '
 "message [No connection adapters were found for 'htt://google.cm']")


#### 3. Extract the content of the webpage that hosts a current dated listing of popular movies

##### 3.a. Write a function call to the user defined function created in 2c with the url (https://www.themoviedb.org/movie) as an input and store the response in a variable

In [32]:
try:
    # URL and NEEDED_HEADERS were defined in the start
    soup_tmdb = get_soup_object(url=URL, headers=NEEDED_HEADERS)
    pprint(soup_tmdb.prettify())

    # I'll persist the html to disk in order to inspect the local copy in a browser developer tool
    # and find out specific tags classes to extract in the subsequent sections.
    #with open(file='tmdb_movie_page.html', mode='w', encoding='utf-8') as file:
    #    file.write(soup_tmdb.prettify())
except Exception as e:
    pprint(f"Error Occoured: {e}")

('<!DOCTYPE html>\n'
 '<html class="no-js" lang="en">\n'
 ' <head>\n'
 '  <title>\n'
 '   Popular Movies — The Movie Database (TMDB)\n'
 '  </title>\n'
 '  <meta content="on" http-equiv="cleartype"/>\n'
 '  <meta charset="utf-8"/>\n'
 '  <meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, '
 'Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" '
 'name="keywords"/>\n'
 '  <meta content="yes" name="mobile-web-app-capable"/>\n'
 '  <meta content="yes" name="apple-mobile-web-app-capable"/>\n'
 '  <meta content="width=device-width,initial-scale=1" name="viewport"/>\n'
 '  <meta content="The Movie Database (TMDB) is a popular, user editable '
 'database for movies and TV shows." name="description"/>\n'
 '  <meta '
 'content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" '
 'name="msapplication-TileImage"/>\n'
 '  <meta content="#032541" name="msapplication-TileColor"/>\n'
 '  <meta content="#032

##### 3.b Print the HTML content associated with the first movie displayed on the web page using appropriate HTML tags to access this listing on the object created in part 3a

By inspecting the HTML we can see all the movies details are enclosed inside a wrapping div with CSS class page_wrapper. Than the movie details are enclosed in an inner DIV tag with CSS class as card and style_1. We will use these 2 classes in find method to get the first occurance of the HTML tags which contains all the HTML for the first listed movie.

In [35]:
if soup_tmdb:
    first_movie_html = soup_tmdb.find('div', class_='card style_1')
    pprint(first_movie_html.prettify())
else:
    print('HTML source for movie details not found')

('<div class="card style_1">\n'
 ' <div class="image">\n'
 '  <div class="wrapper glyphicons_v2 picture grey no_image_holder">\n'
 '   <a class="image" href="/movie/533535-deadpool-wolverine" title="Deadpool '
 '&amp; Wolverine">\n'
 '    <img alt="Deadpool &amp; Wolverine" class="poster w-full" loading="lazy" '
 'src="https://media.themoviedb.org/t/p/w220_and_h330_face/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg" '
 'srcset="https://media.themoviedb.org/t/p/w220_and_h330_face/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg '
 '1x, '
 'https://media.themoviedb.org/t/p/w440_and_h660_face/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg '
 '2x"/>\n'
 '   </a>\n'
 '  </div>\n'
 '  <div class="options" data-id="533535" data-media-type="movie" '
 'data-object-id="5b3b6b8ec3a3684b8900d8fd">\n'
 '   <a aria-label="View Item Options" class="no_click" href="#">\n'
 '    <div class="glyphicons_v2 circle-more white">\n'
 '    </div>\n'
 '   </a>\n'
 '  </div>\n'
 ' </div>\n'
 ' <div class="content">\n'
 '  <div class="consensus tight">\n'
 

##### 3.c. Display the name of the first movie using appropriate HTML tags to access this listing on the object created in part 3a

<small> Note: Since it's mentioned in the task that the object to used is the one that was created in 3a, hence we are directly using soup_tmdb oject. Otherwise the first_movie_html object parsed in 3b can also be utilised.
</small>

In [37]:
if soup_tmdb:
    first_movie = soup_tmdb.find('div', class_='card style_1')
    if first_movie:
        title = first_movie.find('h2').get_text(strip=True)
        print('Movie Title =', title)
else:
    print('Movie title not found')

Movie Title = Deadpool & Wolverine


##### 3.d. Display the user rating of the first movie by using appropriate HTML tags to access this listing on the object created in part 3a

<p>
Instead of a user rating the latest imdb page have user score chart as a percentage. We will extract this info and display it.</p> 

<small> Note: Since it's mentioned in the task that the object to used is the one that was created in 3a, hence we are directly using soup_tmdb oject. Otherwise the first_movie_html object parsed in 3b can also be utilised.
</small>

In [40]:
if soup_tmdb:
    first_movie = soup_tmdb.find('div', class_=['card style_1'])
    if first_movie:
        score_chart = first_movie.find('div', class_='user_score_chart')
        if score_chart:
            score = score_chart.get('data-percent', 'not rated')
            print('User Score Percentage =', f"{score}%")
        else:
            print('not rated')

User Score Percentage = 78%


##### 3.e. For the first movie, extract the part of the url following the string (https://www.themoviedb.org/) using the appropriate HTML tags to extract this portion on the object created in part 3a (do not use built-in string methods).

<p>For example, if the first movie on the web page had the URL (https://www.themoviedb.org/movie/779782) your output should be movie/779782</p>

<small> Note: Since it's mentioned in the task that the object to used is the one that was created in 3a, hence we are directly using soup_tmdb oject. Otherwise the first_movie_html object parsed in 3b can also be utilised.
</small>

The url tag for the movie is already in the relative URL format, hence we don't have to extract it. We will use the slicing technique of string to remove the first / from the relative url

```
<a class="image" href="/movie/533535-deadpool-wolverine" title="Deadpool &amp; Wolverine" /> 
```

In [43]:
if soup_tmdb:
    first_movie = soup_tmdb.find('div', class_=['card style_1'])
    if first_movie:
        relative_url = first_movie.find('a', class_='image')['href'][1:]
        print(relative_url)
    else:
        print('Relative URL for movie not found')

movie/533535-deadpool-wolverine


#### 4. Write user defined functions for each subsection below (i.e., Q4 a, Q4b, Q4c, Q4d, and Q4e) to return

##### 4.a. Titles of all the movies on the page as a Python list

<small>Note For Q4a, Q4b, Q4c: the response object created in Q3a</small>

In [46]:
def get_movie_titles(obj_soup):
    """
    Extracts movie titles from a TMDB BeautifulSoup object.
    :param obj_soup: A BeautifulSoup object of the page
    :return: A list of movie titles
    """
    titles = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])

    titles = [movie.find('h2').get_text(strip=True) for movie in movies]
    return titles

##### We have our soup object in soup_tmdb variable which we will use here

In [48]:
titles = get_movie_titles(soup_tmdb)
pprint(titles)

['Deadpool & Wolverine',
 'Twisters',
 'Despicable Me 4',
 'Inside Out 2',
 'Bad Boys: Ride or Die',
 'Kill',
 'A Quiet Place: Day One',
 'Mayhem!',
 'Alien: Romulus',
 'It Ends with Us',
 'Prey',
 'Dragonkeeper',
 'Saving Bikini Bottom: The Sandy Cheeks Movie',
 'The Instigators',
 'One Fast Move',
 'The Garfield Movie',
 'Furiosa: A Mad Max Saga',
 'Kingdom of the Planet of the Apes',
 'The Convert',
 'Breaking and Re-entering']


##### 4.b. User ratings of all the movies on the page as a Python list

<small>Note For Q4a, Q4b, Q4c: the response object created in Q3a</small>

In [50]:
def get_user_ratings(obj_soup):
    """
    Extracts user ratings from a given soup object of TMDB movies cards.
    :param obj_soup: The soup object of TMBD movies page
    :return: A list of user ratings
    """
    ratings = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])

    for movie in movies:
        score_chart = movie.find('div', class_='user_score_chart')
        if score_chart:
            score = score_chart.get('data-percent', 'not rated')
        else:
            score = 'not rated'
        ratings.append(score)
    return ratings

In [51]:
user_ratings = get_user_ratings(soup_tmdb)
pprint(user_ratings)

['78',
 '71',
 '73',
 '76',
 '76',
 '70',
 '69',
 '67',
 '74',
 '70',
 '65',
 '69',
 '65',
 '65',
 '67',
 '72',
 '76',
 '72',
 '63',
 '64']


##### 4.c. HTML content of all the individual pages of movies collected into a Python list.

<small>Note For Q4a, Q4b, Q4c: the response object created in Q3a</small>

It is mentioned to collect all the html in a python list, while in FAQ it was clarified that the relative URL needs to be collected. I'll collect both as it's not clear what exactly is needed.

In [54]:
def get_movies_html(obj_soup):
    """
    Extracts movie information from a BeautifulSoup object of TMBD movies page.
    :param obj_soup: The soup object representation of TMDB movies page
    :return: A list of dictionaries containing movie information
    """
    movies_html = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])        
    movies_html = [movie for movie in movies]
    return movies_html

def get_movies_details_url(obj_soup):
    """
    Extracts the URL of the movie details page from a BeautifulSoup object of TMBD movies page
    :param obj_soup: The soup object representation of TMDB movies page
    :return: A list of URLs of the movie details page
    """
    movies_details_url = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])
    movies_details_url = [movie.find('a', class_='image')['href'][1:] for movie in movies]
    return movies_details_url

In [55]:
movies_html = get_movies_html(soup_tmdb)
pprint(movies_html)

[<div class="card style_1">
<div class="image">
<div class="wrapper glyphicons_v2 picture grey no_image_holder">
<a class="image" href="/movie/533535-deadpool-wolverine" title="Deadpool &amp; Wolverine">
<img alt="Deadpool &amp; Wolverine" class="poster w-full" loading="lazy" src="https://media.themoviedb.org/t/p/w220_and_h330_face/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg" srcset="https://media.themoviedb.org/t/p/w220_and_h330_face/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg 1x, https://media.themoviedb.org/t/p/w440_and_h660_face/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg 2x"/>
</a>
</div>
<div class="options" data-id="533535" data-media-type="movie" data-object-id="5b3b6b8ec3a3684b8900d8fd">
<a aria-label="View Item Options" class="no_click" href="#"><div class="glyphicons_v2 circle-more white"></div></a>
</div>
</div>
<div class="content">
<div class="consensus tight">
<div class="outer_ring">
<div class="user_score_chart 5b3b6b8ec3a3684b8900d8fd" data-bar-color="#21d07a" data-percent="78" data-track-color="#204529

In [56]:
movies_url_list = get_movies_details_url(soup_tmdb)
pprint(movies_url_list)

['movie/533535-deadpool-wolverine',
 'movie/718821-twisters',
 'movie/519182-despicable-me-4',
 'movie/1022789-inside-out-2',
 'movie/573435-bad-boys-ride-or-die',
 'movie/1160018-kill',
 'movie/762441-a-quiet-place-day-one',
 'movie/959092-farang',
 'movie/945961-alien-romulus',
 'movie/1079091-it-ends-with-us',
 'movie/1129598-prey',
 'movie/588648-dragonkeeper',
 'movie/831815-saving-bikini-bottom-the-sandy-cheeks-movie',
 'movie/1059064-the-instigators',
 'movie/1281826-one-fast-move',
 'movie/748783-the-garfield-movie',
 'movie/786892-furiosa-a-mad-max-saga',
 'movie/653346-kingdom-of-the-planet-of-the-apes',
 'movie/1066262-the-convert',
 'movie/1166073']


#### 4.d. Genres of all the movies on the page as a Python list

<small>For Q4d, Q4e: the list output from Q4c</small>

<p>
    Genere is itself a list, hence, we will pass the list of relative URLs for the movies to the function and the function will return a list of list containing all genere for each movie represented by the each URL in the list.

    Input: List of relative URLs for movies details page
    Output: List of Genere for each movie in the input list
<p>

In [59]:
def get_movies_genere(movie_details_url, debug=True):
    """
    This function takes a movie details URL and returns a list of genres of the movie.
    Args:
    movie_details_url (str): The URLs of the all the movie details pages.
    debug (bool): If True, it will print the genres of the movie. Default is True
    Returns: List of movies generes for each URL's in the input list
    """
    movie_genere = []
    for rel_url in movie_details_url:
        # Construct the detailed movie URL
        details_url = format(f"https://www.themoviedb.org/{rel_url}")
        if debug: print(f"Fetching generes for [{details_url}]")
        # Get the soup object for the detailed page
        soup = get_soup_object(details_url)        
        if soup:
            generes = soup.find('span', class_='genres')
            if generes:
                g = [a.get_text(strip=True) for a in generes.find_all('a')]
            else:
                g = []
        movie_genere.append(g)
    return movie_genere

In [60]:
movie_genere = get_movies_genere(get_movies_details_url(soup_tmdb))
pprint(movie_genere)

Fetching generes for [https://www.themoviedb.org/movie/533535-deadpool-wolverine]
Fetching generes for [https://www.themoviedb.org/movie/718821-twisters]
Fetching generes for [https://www.themoviedb.org/movie/519182-despicable-me-4]
Fetching generes for [https://www.themoviedb.org/movie/1022789-inside-out-2]
Fetching generes for [https://www.themoviedb.org/movie/573435-bad-boys-ride-or-die]
Fetching generes for [https://www.themoviedb.org/movie/1160018-kill]
Fetching generes for [https://www.themoviedb.org/movie/762441-a-quiet-place-day-one]
Fetching generes for [https://www.themoviedb.org/movie/959092-farang]
Fetching generes for [https://www.themoviedb.org/movie/945961-alien-romulus]
Fetching generes for [https://www.themoviedb.org/movie/1079091-it-ends-with-us]
Fetching generes for [https://www.themoviedb.org/movie/1129598-prey]
Fetching generes for [https://www.themoviedb.org/movie/588648-dragonkeeper]
Fetching generes for [https://www.themoviedb.org/movie/831815-saving-bikini-bott

#### 4.e. Cast of all the movies on the page as a Python list

<small>For Q4d, Q4e: the list output from Q4c</small>

<p>
    The cast for each movie have 2 details, Actor and Character. Hence, we will pass a list of movies relative URL and the function will return a list of Dictionary objects containing each cast member for each movie in the format of Actor: Name, Character: Name

    Input: List of relative URLs for movies details page
    Output: List of Dictionary objects containing each cast member for each movie in the list
</p>

In [63]:
def get_movies_cast(movie_details_url, debug=True):
    """
    Get the cast of all the movies from the movie details URL list.
    Args:
    movie_details_url (list): List of URLs of movie details pages.
    debug (bool): If True, print debug messages. Default is True.
    Returns: List of actors in the movie in Actor as Character format
    """
    movies_actor_list = []
    for rel_url in movie_details_url:
        # Construct the detailed movie URL
        details_url = format(f"https://www.themoviedb.org/{rel_url}")
        if debug: print(f"Fetching cast for [{details_url}]")
        # Get the soup object for the detailed page
        soup = get_soup_object(details_url)        
        if soup:            
            casts = soup.find_all('li', class_='card')
            actor_list = []
            if casts:                
                for cst in casts:
                    actor = cst.find('p').find('a').get_text(strip=True)
                    character = cst.find('p', class_=['character']).get_text(strip=True)
                    actor_list.append(f"{actor}, as {character}" if character else f"{actor},")
            movies_actor_list.append(actor_list)
    return movies_actor_list

In [64]:
movie_actor_list = get_movies_cast(get_movies_details_url(soup_tmdb)) 
pprint(movie_actor_list)

Fetching cast for [https://www.themoviedb.org/movie/533535-deadpool-wolverine]
Fetching cast for [https://www.themoviedb.org/movie/718821-twisters]
Fetching cast for [https://www.themoviedb.org/movie/519182-despicable-me-4]
Fetching cast for [https://www.themoviedb.org/movie/1022789-inside-out-2]
Fetching cast for [https://www.themoviedb.org/movie/573435-bad-boys-ride-or-die]
Fetching cast for [https://www.themoviedb.org/movie/1160018-kill]
Fetching cast for [https://www.themoviedb.org/movie/762441-a-quiet-place-day-one]
Fetching cast for [https://www.themoviedb.org/movie/959092-farang]
Fetching cast for [https://www.themoviedb.org/movie/945961-alien-romulus]
Fetching cast for [https://www.themoviedb.org/movie/1079091-it-ends-with-us]
Fetching cast for [https://www.themoviedb.org/movie/1129598-prey]
Fetching cast for [https://www.themoviedb.org/movie/588648-dragonkeeper]
Fetching cast for [https://www.themoviedb.org/movie/831815-saving-bikini-bottom-the-sandy-cheeks-movie]
Fetching cas

#### 5. Write an user defined function that returns a pandas data frame with following data:

* 5.a. Titles of the movies listed on the page
* 5.b. User ratings of the movies listed on the page
* 5.c. Genres of the movies listed on the page
* 5.d. Cast of the movies listed on the page

In [67]:
import pandas as pd

def create_movies_dataframe(movies_page_url, headers=None, debug=False):
    """
    Creates a pandas DataFrame from the movies page of a given URL.
    The function utilizes existing functions for fetching titles, ratings, generes and casts
    and returns a DataFrame with the fetched data.
    Args:
        movies_page_url (str): The list of URL of the movies detail pages.
        headers (dict): A dictionary of headers to be used in the request. Defaults to None.
        debug (bool): A flag to enable debug mode. Defaults to False.
        Returns: A DataFrame with Titles, Ratings, Generes and Actors for all the movies in the input list
    """
    # Send a GET request to the movies page
    # Initialize lists to store the data
    titles = []
    user_ratings = []
    genres = []
    actors = []

    # First lets get the soup object for the page
    soup = get_soup_object(movies_page_url, headers)
    if debug: pprint(soup)
    # Second get the titles
    titles = get_movie_titles(soup)
    if debug: pprint(titles)
    
    # Third get the user ratings
    user_ratings = get_user_ratings(soup)
    if debug: pprint(user_ratings)
    # Fourth get the details URL
    details_urls = get_movies_details_url(soup)
    if debug: pprint(details_urls)
    # fifth get the genres
    genres = get_movies_genere(details_urls, debug=debug)
    if debug: pprint(genres)
    # sixth get the casts
    actors = get_movies_cast(details_urls, debug=debug)
    if debug: pprint(actors)
    
    # Create a DataFrame from the lists
    df = pd.DataFrame({
        'Title': titles,
        'User Rating %': user_ratings,
        'Genres': genres,
        'Cast': actors
    })
    
    return df

In [68]:
df = create_movies_dataframe(URL)
pprint(df.head(5))

                   Title User Rating %  \
0   Deadpool & Wolverine            78   
1               Twisters            71   
2        Despicable Me 4            73   
3           Inside Out 2            76   
4  Bad Boys: Ride or Die            76   

                                   Genres  \
0       [Action, Comedy, Science Fiction]   
1    [Action, Adventure, Drama, Thriller]   
2     [Animation, Family, Comedy, Action]   
3  [Animation, Family, Adventure, Comedy]   
4       [Action, Crime, Thriller, Comedy]   

                                                Cast  
0  [Ryan Reynolds, as Wade Wilson / Deadpool / Ni...  
1  [Daisy Edgar-Jones, as Kate, Glen Powell, as T...  
2  [Steve Carell, as Gru (voice), Kristen Wiig, a...  
3  [Amy Poehler, as Joy (voice), Maya Hawke, as A...  
4  [Will Smith, as Mike Lowrey, Martin Lawrence, ...  


In [69]:
# Write the Dataframe to a CSV file
df.to_csv('tmdb_movies.stats.csv', encoding='utf-8', index=False)

#### 6. Scraping the data and combining the dataframes

##### 6.a Write a function that scrapes data (mentioned in Q5) from page number 1, 2, 3, 4 and 5 on the URL https://www.themoviedb.org/movie and returns 5 data frames which can be exported to csv file by calling the functions defined in Q3a, Q4c and Q5

In [72]:
def create_movies_dataframes_all_pages(movies_page_base_url, page_start=1, page_count=5, debug=True):
    """
    Create a list of dataframes for no of pages of movies data represented by page_count, starting from page_start
    Args:
        movies_page_base_url (str): The base URL for the movies page.
        page_start (int, optional): The starting page number. Defaults to 1.
        page_count (int, optional): The number of pages to fetch. Defaults to 5.
        debug (bool, optional): If True, print debug messages. Defaults to True.
        Returns: List of DataFrames for each page.
    """
    df_list = []
    for i in range(page_start, page_count+1, 1):
        movies_page_url = f"{movies_page_base_url}?page={i}"
        if debug: print(f'Generating dataframe for {movies_page_url}')
        df_movie = create_movies_dataframe(movies_page_url, headers=None, debug=debug)
        df_list.append(df_movie)
    return df_list

In [73]:
page_count = 5
print(f"Scraping and creating data frames for base URL: {URL} for {page_count} pages")
df_list = create_movies_dataframes_all_pages(URL, 1, page_count=page_count, debug=False)
print(f"{len(df_list)} Data Frames acquired")

index = 1
for df in df_list:
    print(f"Writing Data Frame to file DataFrame_{index}.csv")    
    df.to_csv(f'DataFrame_{index}.csv', index=False)    
    index += 1
    
print("Execution Completed...")

Scraping and creating data frames for base URL: https://www.themoviedb.org/movie for 5 pages
5 Data Frames acquired
Writing Data Frame to file DataFrame_1.csv
Writing Data Frame to file DataFrame_2.csv
Writing Data Frame to file DataFrame_3.csv
Writing Data Frame to file DataFrame_4.csv
Writing Data Frame to file DataFrame_5.csv
Execution Completed...


##### 6.b. Combine the data obtained from dataframes in Q6(a)

In [75]:
def combine_all_dataframes(df_list):
    """
    This function combines all dataframes in a list into one dataframe.
    Args:
        df_list (List of DataFrame): List of DataFrames to combine
        Retuns: A combined dataset using concat method
    """
    # Concatenate all data frames in the list into a single data frame
    combined_df = pd.concat(df_list, ignore_index=True)
    combined_df.reset_index()
    return combined_df

In [76]:
combined_df = combine_all_dataframes(df_list)
pprint(combined_df)

                                                Title User Rating %  \
0                                Deadpool & Wolverine            78   
1                                            Twisters            71   
2                                     Despicable Me 4            73   
3                                        Inside Out 2            76   
4                               Bad Boys: Ride or Die            76   
..                                                ...           ...   
95                Spider-Man: Across the Spider-Verse            84   
96  The Chronicles of Narnia: The Lion, the Witch ...            71   
97                                        Black Noise            53   
98                            Spider-Man: No Way Home            80   
99                                    Double Jeopardy            66   

                                             Genres  \
0                 [Action, Comedy, Science Fiction]   
1              [Action, Adventure, Dr

In [77]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Title          100 non-null    object
 1   User Rating %  100 non-null    object
 2   Genres         100 non-null    object
 3   Cast           100 non-null    object
dtypes: object(4)
memory usage: 3.3+ KB


In [78]:
combined_df.describe()

Unnamed: 0,Title,User Rating %,Genres,Cast
count,100,100,100,100
unique,100,37,76,100
top,Deadpool & Wolverine,70,"[Horror, Thriller]","[Ryan Reynolds, as Wade Wilson / Deadpool / Ni..."
freq,1,8,5,1


In [79]:
combined_df.to_csv('TMDB_DataFrme_Combined.csv', index=False)