### Objective:

<p>
The goal is to extract data (from a chosen number of pages) from The Movie Database website
(https://www.themoviedb.org/) into a tabular data format so that further analysis (e.g., details about a
movie's genre, cast, and user rating) can be facilitated.
</p>

In [5]:
! pip install requests



In [6]:
#For pretty printing, this is built in python standard library, no need to install
from pprint import pprint

### 1. Establish a connection to webpage [TheMovieDb](https://www.themoviedb.org/movie)

In [7]:
# define some Constants, conventionally we represent Constants as all caps
# URL to scrape
URL = 'https://www.themoviedb.org/movie'

# User-Agent header to avoid 403 Forbidden error
# Specifying the expected response language to be English 
# Otherwise, by default regional response (title description) would be returned
NEEDED_HEADERS = {
    'User-Agent':"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    'Accept-Language': "en-US,en;q=0.9"
}

##### Import requests library and formulate a get request to download the contents of the webpage:
("https://www.themoviedb.org/movie")

In [8]:
import requests

response = requests.get(url=URL, headers=NEEDED_HEADERS)
response

<Response [200]>

##### Verify the status code of the request and confirm that the request was executed appropriately

In [9]:
if response:
    if response.status_code == 200:
        print(f"Request was successfull with status code [{response.status_code}] to [{URL}]")
    else:
        print(f"Request failed with status code [{response.status_code}] to [{URL}]")
else:
    print(f"Request failed with no response from [{URL}]")

Request was successfull with status code [200] to [https://www.themoviedb.org/movie]


In [10]:
print(f"Type of content variable is [{type(response)}]")
print(f"Printing top 200 characters from the content using string slicing")
pprint(response.content[:500])

Type of content variable is [<class 'requests.models.Response'>]
Printing top 200 characters from the content using string slicing
(b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popul'
 b'ar Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv'
 b'="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="ke'
 b'ywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresse'
 b's, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n  '
 b'  <meta name="mobile-web-app-capable" content="yes">\n    <meta name="app'
 b'le-mobile-web-app-capable" content="yes">\n    <meta name="viewpo')


#### 2. Parse the content of HTML response using the BeautifulSoup library and execute the tasks

In [14]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting beautifulsoup4 (from bs4)
  Downloading beautifulsoup4-4.13.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Downloading beautifulsoup4-4.13.3-py3-none-any.whl (186 kB)
Downloading soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.13.3 bs4-0.0.2 soupsieve-2.6


In [12]:
!pip install BeautifulSoup

Collecting BeautifulSoup
  Downloading BeautifulSoup-3.2.2.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'


  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [25 lines of output]
      Traceback (most recent call last):
        File "E:\Program Files\Python313\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 389, in <module>
          main()
          ~~~~^^
        File "E:\Program Files\Python313\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 373, in main
          json_out["return_val"] = hook(**hook_input["kwargs"])
                                   ~~~~^^^^^^^^^^^^^^^^^^^^^^^^
        File "E:\Program Files\Python313\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 143, in get_requires_for_build_wheel
          return hook(config_settings)
        File "C:\Users\smkamran\AppData\Local\Temp\pip-build-env-nlis5p9c\overlay\Lib\site-packages\setuptools\build_meta.py", line 334, in get_requires_for_build_wheel
        

In [15]:
from bs4 import BeautifulSoup

#Create the Soup object using html parse
soup = BeautifulSoup(response.content, 'html.parser')


##### Extract the title of the parsed web page content

In [16]:
soup.title.string

'Popular Movies — The Movie Database (TMDB)'

##### Write a user defined function to generalize the task

In [17]:
def get_soup_object(url, headers=None):
    """
    This function takes a URL as input and returns the BeautifulSoup object of the webpage.
    It is assumed that the BeautifulSoup and requests libraries are already imported by the caller.
    The function make use of html.parser parser
    
    Args:
    url (string): The url for the web page
    headers (dictionary (String:String)): The headers to be used in the request. Default is None. It is not required to pass 
    any headers. If no values are passed then the default headers of User-Agent and Accept-Language will be used.

    Returns:
    soup (BeautifulSoup object): The BeautifulSoup object of the webpage

    Example:
    >>> get_soup('https://www.test.com')
    """

    # Set the headers conditionally, if no headers are passed
    # These headers are local to the get_soup_object function 
    if headers is None:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9"
        }

    try:
        # Send a GET request to the URL with User-Agent and Accept Language Headers
        res = requests.get(url=url, headers=headers)

        if res.status_code != 200:
            raise Exception (
                f"Request failed for [{url}] with StatusCode:[{res.status_code}] and Reason:[{res.reason}]"
            )                   
        return BeautifulSoup(res.content, 'html.parser')    
    except requests.exceptions.RequestException as e:
        raise Exception(f"Request error occoured for [{url}] with error message [{e}]")

#### 3. Extract the content of the webpage that hosts a current dated listing of popular movies

In [18]:
try:
    # URL and NEEDED_HEADERS were defined in the start
    soup_tmdb = get_soup_object(url=URL, headers=NEEDED_HEADERS)
    pprint(soup_tmdb.prettify()[:1000])
except Exception as e:
    pprint(f"Error Occoured: {e}")

('<!DOCTYPE html>\n'
 '<html class="no-js" lang="en">\n'
 ' <head>\n'
 '  <title>\n'
 '   Popular Movies — The Movie Database (TMDB)\n'
 '  </title>\n'
 '  <meta content="on" http-equiv="cleartype"/>\n'
 '  <meta charset="utf-8"/>\n'
 '  <meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, '
 'Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" '
 'name="keywords"/>\n'
 '  <meta content="yes" name="mobile-web-app-capable"/>\n'
 '  <meta content="yes" name="apple-mobile-web-app-capable"/>\n'
 '  <meta content="width=device-width,initial-scale=1" name="viewport"/>\n'
 '  <meta content="The Movie Database (TMDB) is a popular, user editable '
 'database for movies and TV shows." name="description"/>\n'
 '  <meta '
 'content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" '
 'name="msapplication-TileImage"/>\n'
 '  <meta content="#032541" name="msapplication-TileColor"/>\n'
 '  <meta content="#032

##### Print the HTML content associated with the first movie displayed on the web page

By inspecting the HTML we can see all the movies details are enclosed inside a wrapping div with CSS class page_wrapper. Than the movie details are enclosed in an inner DIV tag with CSS class as card and style_1. We will use these 2 classes in find method to get the first occurance of the HTML tags which contains all the HTML for the first listed movie.

In [19]:
if soup_tmdb:
    first_movie_html = soup_tmdb.find('div', class_='card style_1')
    pprint(first_movie_html.prettify()[0:1000])
else:
    print('HTML source for movie details not found')

('<div class="card style_1">\n'
 ' <div class="image">\n'
 '  <div class="wrapper glyphicons_v2 picture grey no_image_holder">\n'
 '   <a class="image" href="/movie/762509-mufasa-the-lion-king" title="Mufasa: '
 'The Lion King">\n'
 '    <img alt="Mufasa: The Lion King" class="poster w-full" loading="lazy" '
 'src="https://media.themoviedb.org/t/p/w220_and_h330_face/9bXHaLlsFYpJUutg4E6WXAjaxDi.jpg" '
 'srcset="https://media.themoviedb.org/t/p/w220_and_h330_face/9bXHaLlsFYpJUutg4E6WXAjaxDi.jpg '
 '1x, '
 'https://media.themoviedb.org/t/p/w440_and_h660_face/9bXHaLlsFYpJUutg4E6WXAjaxDi.jpg '
 '2x"/>\n'
 '   </a>\n'
 '  </div>\n'
 '  <div class="options" data-id="762509" data-media-type="movie" '
 'data-object-id="5fa9c3759ac5350041b9b405">\n'
 '   <a aria-label="View Item Options" class="no_click" href="#">\n'
 '    <div class="glyphicons_v2 circle-more white">\n'
 '    </div>\n'
 '   </a>\n'
 '  </div>\n'
 ' </div>\n'
 ' <div class="content">\n'
 '  <div class="consensus tight">\n'
 '   

##### Display the name of the first movie

In [20]:
if soup_tmdb:
    first_movie = soup_tmdb.find('div', class_='card style_1')
    if first_movie:
        title = first_movie.find('h2').get_text(strip=True)
        print('Movie Title =', title)
else:
    print('Movie title not found')

Movie Title = Mufasa: The Lion King


##### Display the user rating (Score Chart) of the first movie

In [21]:
if soup_tmdb:
    first_movie = soup_tmdb.find('div', class_=['card style_1'])
    if first_movie:
        score_chart = first_movie.find('div', class_='user_score_chart')
        if score_chart:
            score = score_chart.get('data-percent', 'not rated')
            print('User Score Percentage =', f"{score}%")
        else:
            print('not rated')

User Score Percentage = 75%


##### Extract the part of the url

The url tag for the movie is in the relative URL format, hence we don't have to extract it. We will use the slicing technique of string to remove the first / from the relative url

```
<a class="image" href="/movie/533535-deadpool-wolverine" title="Deadpool &amp; Wolverine" /> 
```

In [22]:
if soup_tmdb:
    first_movie = soup_tmdb.find('div', class_=['card style_1'])
    if first_movie:
        relative_url = first_movie.find('a', class_='image')['href'][1:]
        print(relative_url)
    else:
        print('Relative URL for movie not found')

movie/762509-mufasa-the-lion-king


#### Define functions for extracting info like
* Titles
* Ratings
* Genere
* Cast

##### Titles of all the movies on the page as a Python list

In [23]:
def get_movie_titles(obj_soup):
    """
    Extracts movie titles from a TMDB BeautifulSoup object.
    :param obj_soup: A BeautifulSoup object of the page
    :return: A list of movie titles
    """
    titles = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])

    titles = [movie.find('h2').get_text(strip=True) for movie in movies]
    return titles

##### We have our soup object in soup_tmdb variable which we will use here

In [24]:
titles = get_movie_titles(soup_tmdb)
pprint(titles[:10])

['Mufasa: The Lion King',
 'Moana 2',
 'Sonic the Hedgehog 3',
 'Amaran',
 'Captain America: Brave New World',
 'The Gorge',
 'Panda Plan',
 'Flight Risk',
 'My Fault: London',
 'Kraven the Hunter']


##### User ratings of all the movies

In [25]:
def get_user_ratings(obj_soup):
    """
    Extracts user ratings from a given soup object of TMDB movies cards.
    :param obj_soup: The soup object of TMBD movies page
    :return: A list of user ratings
    """
    ratings = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])

    for movie in movies:
        score_chart = movie.find('div', class_='user_score_chart')
        if score_chart:
            score = score_chart.get('data-percent', 'not rated')
        else:
            score = 'not rated'
        ratings.append(score)
    return ratings

In [26]:
user_ratings = get_user_ratings(soup_tmdb)
pprint(user_ratings[:10])

['75', '72', '78', '76', '62', '78', '72', '58', '75', '67']


##### HTML content of all the individual pages

In [27]:
def get_movies_html(obj_soup):
    """
    Extracts movie information from a BeautifulSoup object of TMBD movies page.
    :param obj_soup: The soup object representation of TMDB movies page
    :return: A list of dictionaries containing movie information
    """
    movies_html = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])        
    movies_html = [movie for movie in movies]
    return movies_html

def get_movies_details_url(obj_soup):
    """
    Extracts the URL of the movie details page from a BeautifulSoup object of TMBD movies page
    :param obj_soup: The soup object representation of TMDB movies page
    :return: A list of URLs of the movie details page
    """
    movies_details_url = []
    #get all movie cards
    movies = obj_soup.find_all('div', class_=['card style_1'])
    movies_details_url = [movie.find('a', class_='image')['href'][1:] for movie in movies]
    return movies_details_url

In [28]:
movies_html = get_movies_html(soup_tmdb)
pprint(movies_html[0])

<div class="card style_1">
<div class="image">
<div class="wrapper glyphicons_v2 picture grey no_image_holder">
<a class="image" href="/movie/762509-mufasa-the-lion-king" title="Mufasa: The Lion King">
<img alt="Mufasa: The Lion King" class="poster w-full" loading="lazy" src="https://media.themoviedb.org/t/p/w220_and_h330_face/9bXHaLlsFYpJUutg4E6WXAjaxDi.jpg" srcset="https://media.themoviedb.org/t/p/w220_and_h330_face/9bXHaLlsFYpJUutg4E6WXAjaxDi.jpg 1x, https://media.themoviedb.org/t/p/w440_and_h660_face/9bXHaLlsFYpJUutg4E6WXAjaxDi.jpg 2x"/>
</a>
</div>
<div class="options" data-id="762509" data-media-type="movie" data-object-id="5fa9c3759ac5350041b9b405">
<a aria-label="View Item Options" class="no_click" href="#"><div class="glyphicons_v2 circle-more white"></div></a>
</div>
</div>
<div class="content">
<div class="consensus tight">
<div class="outer_ring">
<div class="user_score_chart 5fa9c3759ac5350041b9b405" data-bar-color="#21d07a" data-percent="75" data-track-color="#204529">
<d

In [29]:
movies_url_list = get_movies_details_url(soup_tmdb)
pprint(movies_url_list[:10])

['movie/762509-mufasa-the-lion-king',
 'movie/1241982-moana-2',
 'movie/939243-sonic-the-hedgehog-3',
 'movie/927342',
 'movie/822119-captain-america-brave-new-world',
 'movie/950396-the-gorge',
 'movie/1160956',
 'movie/1126166-flight-risk',
 'movie/1294203-my-fault-london',
 'movie/539972-kraven-the-hunter']


#### Genres of all the movies

<p>
    Genere is itself a list, hence, we will pass the list of relative URLs for the movies to the function and the function will return a list of list containing all genere for each movie represented by the each URL in the list.

    Input: List of relative URLs for movies details page
    Output: List of Genere for each movie in the input list
<p>

In [30]:
def get_movies_genere(movie_details_url, debug=False):
    """
    This function takes a movie details URL and returns a list of genres of the movie.
    Args:
    movie_details_url (str): The URLs of the all the movie details pages.
    debug (bool): If True, it will print the genres of the movie. Default is False
    Returns: List of movies generes for each URL's in the input list
    """
    movie_genere = []
    for rel_url in movie_details_url:
        # Construct the detailed movie URL
        details_url = format(f"https://www.themoviedb.org/{rel_url}")
        if debug: print(f"Fetching generes for [{details_url}]")
        # Get the soup object for the detailed page
        soup = get_soup_object(details_url)        
        if soup:
            generes = soup.find('span', class_='genres')
            if generes:
                g = [a.get_text(strip=True) for a in generes.find_all('a')]
            else:
                g = []
        movie_genere.append(g)
    return movie_genere

In [31]:
movie_genere = get_movies_genere(get_movies_details_url(soup_tmdb))
pprint(movie_genere[:10])

[['Adventure', 'Family', 'Animation'],
 ['Animation', 'Adventure', 'Family', 'Comedy'],
 ['Action', 'Science Fiction', 'Comedy', 'Family'],
 ['Action', 'Drama', 'Adventure', 'War'],
 ['Action', 'Thriller', 'Science Fiction'],
 ['Romance', 'Science Fiction', 'Thriller'],
 ['Action', 'Comedy'],
 ['Action', 'Thriller', 'Crime'],
 ['Romance', 'Drama'],
 ['Action', 'Adventure', 'Thriller']]


#### Cast of all the movies

<p>
    The cast for each movie have 2 details, Actor and Character. Hence, we will pass a list of movies relative URL and the function will return a list of Dictionary objects containing each cast member for each movie in the format of Actor: Name, Character: Name

    Input: List of relative URLs for movies details page
    Output: List of Dictionary objects containing each cast member for each movie in the list
</p>

In [32]:
def get_movies_cast(movie_details_url, debug=False):
    """
    Get the cast of all the movies from the movie details URL list.
    Args:
    movie_details_url (list): List of URLs of movie details pages.
    debug (bool): If True, print debug messages. Default is True.
    Returns: List of actors in the movie in Actor as Character format
    """
    movies_actor_list = []
    for rel_url in movie_details_url:
        # Construct the detailed movie URL
        details_url = format(f"https://www.themoviedb.org/{rel_url}")
        if debug: print(f"Fetching cast for [{details_url}]")
        # Get the soup object for the detailed page
        soup = get_soup_object(details_url)        
        if soup:            
            casts = soup.find_all('li', class_='card')
            actor_list = []
            if casts:                
                for cst in casts:
                    actor = cst.find('p').find('a').get_text(strip=True)
                    character = cst.find('p', class_=['character']).get_text(strip=True)
                    actor_list.append(f"{actor}, as {character}" if character else f"{actor},")
            movies_actor_list.append(actor_list)
    return movies_actor_list

In [33]:
movie_actor_list = get_movies_cast(get_movies_details_url(soup_tmdb)) 
pprint(movie_actor_list[:5])

[['Aaron Pierre, as Mufasa (voice)',
  'Kelvin Harrison Jr., as Taka (voice)',
  'Tiffany Boone, as Sarabi (voice)',
  'Kagiso Lediga, as Young Rafiki (voice)',
  'Preston Nyman, as Zazu (voice)',
  'Blue Ivy Carter, as Kiara (voice)',
  'John Kani, as Rafiki (voice)',
  'Mads Mikkelsen, as Kiros (voice)',
  'Seth Rogen, as Pumbaa (voice)'],
 ['Auliʻi Cravalho, as Moana (voice)',
  'Dwayne Johnson, as Maui (voice)',
  'Hualālai Chung, as Moni (voice)',
  'Rose Matafeo, as Loto (voice)',
  'David Fane, as Kele (voice)',
  'Awhimai Fraser, as Matangi (voice)',
  'Khaleesi Lambert-Tsuda, as Simea (voice)',
  'Temuera Morrison, as Chief Tui (voice)',
  'Nicole Scherzinger, as Sina (voice)'],
 ['Jim Carrey, as Ivo Robotnik / Gerald Robotnik',
  'Ben Schwartz, as Sonic (voice)',
  'Keanu Reeves, as Shadow (voice)',
  'Idris Elba, as Knuckles (voice)',
  "Colleen O'Shaughnessey, as Tails (voice)",
  'James Marsden, as Tom',
  'Tika Sumpter, as Maddie',
  'Lee Majdoub, as Agent Stone',
  'Krys

#### We get a data frame with following data:

* Titles of the movies listed on the page
* User ratings of the movies listed on the page
* Genres of the movies listed on the page
* Cast of the movies listed on the page

In [34]:
import pandas as pd

def create_movies_dataframe(movies_page_url, headers=None, debug=False):
    """
    Creates a pandas DataFrame from the movies page of a given URL.
    The function utilizes existing functions for fetching titles, ratings, generes and casts
    and returns a DataFrame with the fetched data.
    Args:
        movies_page_url (str): The list of URL of the movies detail pages.
        headers (dict): A dictionary of headers to be used in the request. Defaults to None.
        debug (bool): A flag to enable debug mode. Defaults to False.
        Returns: A DataFrame with Titles, Ratings, Generes and Actors for all the movies in the input list
    """
    # Send a GET request to the movies page
    # Initialize lists to store the data
    titles = []
    user_ratings = []
    genres = []
    actors = []

    # First lets get the soup object for the page
    soup = get_soup_object(movies_page_url, headers)
    if debug: pprint(soup)
    # Second get the titles
    titles = get_movie_titles(soup)
    if debug: pprint(titles)
    
    # Third get the user ratings
    user_ratings = get_user_ratings(soup)
    if debug: pprint(user_ratings)
    # Fourth get the details URL
    details_urls = get_movies_details_url(soup)
    if debug: pprint(details_urls)
    # fifth get the genres
    genres = get_movies_genere(details_urls, debug=debug)
    if debug: pprint(genres)
    # sixth get the casts
    actors = get_movies_cast(details_urls, debug=debug)
    if debug: pprint(actors)
    
    # Create a DataFrame from the lists
    df = pd.DataFrame({
        'Title': titles,
        'User Rating %': user_ratings,
        'Genres': genres,
        'Cast': actors
    })
    
    return df

In [35]:
df = create_movies_dataframe(URL)
pprint(df.head(5))

                              Title User Rating %  \
0             Mufasa: The Lion King            75   
1                           Moana 2            72   
2              Sonic the Hedgehog 3            78   
3                            Amaran            76   
4  Captain America: Brave New World            62   

                                      Genres  \
0             [Adventure, Family, Animation]   
1     [Animation, Adventure, Family, Comedy]   
2  [Action, Science Fiction, Comedy, Family]   
3            [Action, Drama, Adventure, War]   
4        [Action, Thriller, Science Fiction]   

                                                Cast  
0  [Aaron Pierre, as Mufasa (voice), Kelvin Harri...  
1  [Auliʻi Cravalho, as Moana (voice), Dwayne Joh...  
2  [Jim Carrey, as Ivo Robotnik / Gerald Robotnik...  
3  [Sivakarthikeyan, as Major Mukund Varadarajan,...  
4  [Anthony Mackie, as Sam Wilson / Captain Ameri...  


#### 6. Combining the dataframes

In [36]:
def create_movies_dataframes_all_pages(movies_page_base_url, page_start=1, page_count=5, debug=False):
    """
    Create a list of dataframes for no of pages of movies data represented by page_count, starting from page_start
    Args:
        movies_page_base_url (str): The base URL for the movies page.
        page_start (int, optional): The starting page number. Defaults to 1.
        page_count (int, optional): The number of pages to fetch. Defaults to 5.
        debug (bool, optional): If True, print debug messages. Defaults to True.
        Returns: List of DataFrames for each page.
    """
    df_list = []
    for i in range(page_start, page_count+1, 1):
        movies_page_url = f"{movies_page_base_url}?page={i}"
        if debug: print(f'Generating dataframe for {movies_page_url}')
        df_movie = create_movies_dataframe(movies_page_url, headers=None, debug=False)
        df_list.append(df_movie)
    return df_list

In [37]:
page_count = 5
print(f"Scraping and creating data frames for base URL: {URL} for {page_count} pages")
df_list = create_movies_dataframes_all_pages(URL, 1, page_count=page_count, debug=True)
print(f"{len(df_list)} Data Frames acquired")

index = 1
for df in df_list:
    print(f"Writing Data Frame to file DataFrame_{index}.csv")    
    df.to_csv(f'DataFrame_{index}.csv', index=False)    
    index += 1
    
print("Execution Completed...")

Scraping and creating data frames for base URL: https://www.themoviedb.org/movie for 5 pages
Generating dataframe for https://www.themoviedb.org/movie?page=1
Generating dataframe for https://www.themoviedb.org/movie?page=2
Generating dataframe for https://www.themoviedb.org/movie?page=3
Generating dataframe for https://www.themoviedb.org/movie?page=4
Generating dataframe for https://www.themoviedb.org/movie?page=5
5 Data Frames acquired
Writing Data Frame to file DataFrame_1.csv
Writing Data Frame to file DataFrame_2.csv
Writing Data Frame to file DataFrame_3.csv
Writing Data Frame to file DataFrame_4.csv
Writing Data Frame to file DataFrame_5.csv
Execution Completed...


##### Combine the data obtained from dataframes

In [38]:
def combine_all_dataframes(df_list):
    """
    This function combines all dataframes in a list into one dataframe.
    Args:
        df_list (List of DataFrame): List of DataFrames to combine
        Retuns: A combined dataset using concat method
    """
    # Concatenate all data frames in the list into a single data frame
    combined_df = pd.concat(df_list, ignore_index=True)
    combined_df.reset_index()
    return combined_df

In [39]:
combined_df = combine_all_dataframes(df_list)
pprint(combined_df)

                               Title User Rating %  \
0              Mufasa: The Lion King            75   
1                            Moana 2            72   
2               Sonic the Hedgehog 3            78   
3                             Amaran            76   
4   Captain America: Brave New World            62   
..                               ...           ...   
95      Sexy Oral: Uwakina Kuchibiru            60   
96                 Honeymoon Crasher            52   
97                           Titanic            79   
98                365 Days: This Day            59   
99                   The Dark Knight            85   

                                       Genres  \
0              [Adventure, Family, Animation]   
1      [Animation, Adventure, Family, Comedy]   
2   [Action, Science Fiction, Comedy, Family]   
3             [Action, Drama, Adventure, War]   
4         [Action, Thriller, Science Fiction]   
..                                        ...   
95      

In [40]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Title          100 non-null    object
 1   User Rating %  100 non-null    object
 2   Genres         100 non-null    object
 3   Cast           100 non-null    object
dtypes: object(4)
memory usage: 3.3+ KB


In [41]:
combined_df.describe()

Unnamed: 0,Title,User Rating %,Genres,Cast
count,100,100,100,100
unique,100,35,71,99
top,Mufasa: The Lion King,71,"[Romance, Drama]",[]
freq,1,11,6,2


In [42]:
combined_df.to_csv('TMDB_DataFrme_Combined.csv', index=False)
print("Execution Completed, Collect the data from Output Folder")

Execution Completed, Collect the data from Output Folder
