this Lab illustrates how to use `BeautifulSoup` to scrape the International Movies Database (IMDB) at [imdb.com](https://imdb.com) for top films released in year 2023 with the highest US box office. 

The final dataframe will contains the below elements:

* `name` - title of the movie, 
* `year` - release year of the movie, 
* `imdb` - IMDB score of the movie, 
* `m_score` - meta score of the movie, 
* `vote` - number of votes.

First, we import the requried packages

In [1]:
import bs4
import requests
import time
import random as ran
import sys
import pandas as pd

Now, search the [top 1000 films released in year of 2023 at imdb.com](https://www.imdb.com/search/title/?release_date=2023&sort=boxoffice_gross_us,desc&start=1) and scrape results from the first page

In [2]:
url = 'https://www.imdb.com/search/title?release_date=2023&sort=boxoffice_gross_us,desc&start=1'

source = requests.get(url).text
soup = bs4.BeautifulSoup(source,'html.parser')

Since above code extracts all data on the first page, below code is run only to extract movie information on it.

In [3]:
movie_blocks = soup.findAll('div',{'class':'lister-item-content'})

Before extracting information across all movies, try first to examine one of the extracted block to identify the elements that we need to scrape.

Below  the elements from the first movie block are extracted

In [4]:
mname = movie_blocks[0].find('a').get_text() # Name of the movie

m_reyear = int(movie_blocks[0].find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year

m_rating = float(movie_blocks[0].find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating

m_mscore = float(movie_blocks[0].find('span',{'class':'metascore favorable'}))
m_votes = int(movie_blocks[0].find('span',{'name':'nv'}).get('data-value')) # votes

print("Movie Name: " + mname,
      "\nRelease Year: " + str(m_reyear),
      "\nIMDb Rating: " + str(m_rating),
      "\nMeta score: " + str(m_mscore),
      "\nVotes: " + '{:,}'.format(m_votes)

)

TypeError: float() argument must be a string or a real number, not 'NoneType'

Once you examine the resulting pages of the imbd search that we initially did , it's obvious that by editing the html link it is possible to view all search results. Thus we will be using this feature during the scrape to iterate through all pages.

Now since scraping the data is an iterative process, we define separate functions for each purpose.

First wa are going to define a function which will extract the targeted elements from a 'movie block list' (discussed above)

In [5]:
def scrape_mblock(movie_block):
    
    movieb_data ={}
  
    try:
        movieb_data['name'] = movie_block.find('a').get_text() # Name of the movie
    except:
        movieb_data['name'] = None

    try:    
        movieb_data['year'] = str(movie_block.find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year
    except:
        movieb_data['year'] = None

    try:
        movieb_data['rating'] = float(movie_block.find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating
    except:
        movieb_data['rating'] = None

    try:
        movieb_data['m_score'] = float(movie_block.find('span',{'class':'metascore favorable'}).contents[0].strip()) #meta score
    except:
        movieb_data['m_score'] = None

    try:
        movieb_data['votes'] = int(movie_block.find('span',{'name':'nv'}).get('data-value')) # votes
    except:
        movieb_data['votes'] = None

    return movieb_data
    

Then we create the below function to scrape all movie blocks within a single search result page

In [6]:
def scrape_m_page(movie_blocks):
    
    page_movie_data = []
    num_blocks = len(movie_blocks)
    
    for block in range(num_blocks):
        page_movie_data.append(scrape_mblock(movie_blocks[block]))
    
    return page_movie_data

Now we built functions to extract all movie data from a single page.

Next function will be created to iterate the above made function through all pages of the search result untill we scrape data for the targeted number of movies

In [7]:
def scrape_this(link,t_count):
    
    #from IPython.core.debugger import set_trace

    base_url = link
    target = t_count
    
    current_mcount_start = 0
    current_mcount_end = 0
    remaining_mcount = target - current_mcount_end 
    
    new_page_number = 1
    
    movie_data = []
    
    
    while remaining_mcount > 0:

        url = base_url + str(new_page_number)
        
        #set_trace()
        
        source = requests.get(url).text
        soup = bs4.BeautifulSoup(source,'html.parser')
        
        movie_blocks = soup.findAll('div',{'class':'lister-item-content'})
        
        movie_data.extend(scrape_m_page(movie_blocks))   
        
        current_mcount_start = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[0])

        current_mcount_end = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[1].split(" ")[0])

        remaining_mcount = target - current_mcount_end
        
        print('\r' + "currently scraping movies from: " + str(current_mcount_start) + " - "+str(current_mcount_end), "| remaining count: " + str(remaining_mcount), flush=True, end ="")
        
        new_page_number = current_mcount_end + 1
        
        time.sleep(ran.randint(0, 10))
    
    return movie_data
    
    

Finally, we put together all functions created above to scrape the top 150 movies on the list

In [8]:
base_scraping_link = "https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=boxoffice_gross_us,desc&start="

top_movies = 150 #input("How many movies do you want to scrape?")
films = []

films = scrape_this(base_scraping_link,int(top_movies))

print('\r'+"List of top " + str(top_movies) +" movies:" + "\n", end="\n")
pd.DataFrame(films)

currently scraping movies from: 101 - 150 | remaining count: 0

KeyboardInterrupt: 

### Assignment: 

1. create a web app using Dash and Plotly
2. scrap the content of your choice (example: top 250, Top box office, or the results of your own query)
3. visualize your results through multiple charts as we did with worldometers website 
4. try to create your own charts based on the choosed content 

In [35]:
import dash
import dash_html_components as html
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
import dash_core_components as dcc

In [28]:
# Scraping Code
start_url = "https://www.imdb.com/chart/top"
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}

response = requests.get(start_url, headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
movie_containers = soup.select('td.titleColumn')[:100]

data = []

for movie_container in movie_containers:
    movie_url = movie_container.select_one('a[href^="/title/"]')
    full_movie_url = "https://www.imdb.com" + movie_url['href']
    response = requests.get(full_movie_url, headers=header)
    soup = BeautifulSoup(response.content, 'html.parser')

    rank = movie_containers.index(movie_container) + 1
    movie_name = movie_container.get_text(strip=True).split('.')[1].strip()
    movie_year_elem = movie_container.select_one('span.secondaryInfo')
    movie_year = movie_year_elem.get_text(strip=True)[1:-1] if movie_year_elem else None
    genre_elems = soup.select('span.ipc-chip__text')
    genre = [g.get_text(strip=True) for g in genre_elems if g.get_text(strip=True) != "Back to top"]
    director_elem = soup.select_one('a[href^="/name/"]')
    director_name = director_elem.get_text(strip=True) if director_elem else None
    rating_elem = soup.select_one('div[data-testid="hero-rating-bar__aggregate-rating__score"] span')
    rating = rating_elem.get_text(strip=True) if rating_elem else None
    actors_list = [a.get_text(strip=True) for a in soup.select('a[data-testid="title-cast-item__actor"]')]

    movie_data = {
        'Rank': rank,
        'Movie Name': movie_name,
        'Year': movie_year,
        'Genre': ', '.join(genre),
        'Director': director_name,
        'Rating': rating,
        'Actors': ', '.join(actors_list),
        'Movie URL': full_movie_url
    }
    data.append(movie_data)

df = pd.DataFrame(data)


In [29]:
df.head()

Unnamed: 0,Rank,Movie Name,Year,Genre,Director,Rating,Actors,Movie URL
0,1,The Shawshank Redemption(1994),1994,Drama,Frank Darabont,9.3,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",https://www.imdb.com/title/tt0111161/?pf_rd_m=...
1,2,The Godfather(1972),1972,"Crime, Drama",Francis Ford Coppola,9.2,"Marlon Brando, Al Pacino, James Caan, Diane Ke...",https://www.imdb.com/title/tt0068646/?pf_rd_m=...
2,3,The Dark Knight(2008),2008,"Action, Crime, Drama",Christopher Nolan,9.0,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",https://www.imdb.com/title/tt0468569/?pf_rd_m=...
3,4,The Godfather Part II(1974),1974,"Crime, Drama",Francis Ford Coppola,9.0,"Al Pacino, Robert De Niro, Robert Duvall, Dian...",https://www.imdb.com/title/tt0071562/?pf_rd_m=...
4,5,12 Angry Men(1957),1957,"Crime, Drama",Sidney Lumet,9.0,"Henry Fonda, Lee J. Cobb, Martin Balsam, John ...",https://www.imdb.com/title/tt0050083/?pf_rd_m=...


In [34]:

# Create a bar chart of movie ratings
rating_chart = go.Bar(
    x=df['Movie Name'],
    y=df['Rating'],
    marker=dict(color='rgb(63, 81, 181)'),
    name='Rating'
)

# Create a scatter plot of movie years vs. ratings
# Create a scatter plot of movie years vs. ratings
scatter_chart = go.Scatter(
    x=df['Year'],
    y=df['Rating'],
    mode='markers',
    marker=dict(
        size=10,
        color='rgb(76, 175, 80)',
        symbol='circle'
    ),
    name='Rating vs. Year'
)
histogram_chart = go.Histogram(
    x=df['Year'],
    y=df['Rating'],
    nbinsx=10,  # Adjust the number of bins as needed
    marker=dict(color='rgb(76, 175, 80)'),
    name='Rating vs. Year'
)


# Create a pie chart of movie genres
genre_counts = df['Genre'].str.split(', ').explode().value_counts()
genre_chart = go.Pie(
    labels=genre_counts.index,
    values=genre_counts.values,
    hole=0.3,
    marker=dict(colors=['rgb(244, 67, 54)', 'rgb(33, 150, 243)', 'rgb(255, 193, 7)'])
)

# Define the Dash app layout
app = dash.Dash(__name__)

app.layout = html.Div([
    html.H1("Scraped Movie Details"),
    html.Div([
        dcc.Graph(
            id='rating-chart',
            figure={
                'data': [rating_chart],
                'layout': {
                    'title': 'Movie Ratings'
                }
            }
        )
    ]),
    html.Div([
        dcc.Graph(
            id='scatter-chart',
            figure={
                'data': [scatter_chart],
                'layout': {
                    'title': 'Movie Ratings vs. Year'
                }
            }
        )
    ]),
    html.Div([
        dcc.Graph(
            id='histogram-chart',
            figure={
                'data': [histogram_chart],
                'layout': {
                    'title': 'Rating vs. Year Histogram',
                    'xaxis': {'title': 'Year'},
                    'yaxis': {'title': 'Rating Count'}
                }
            }
        )
    ])
    ,
    html.Div([
        dcc.Graph(
            id='genre-chart',
            figure={
                'data': [genre_chart],
                'layout': {
                    'title': 'Movie Genres'
                }
            }
        )
    ])
])

if __name__ == '__main__':
    app.run_server()


Dash is running on http://127.0.0.1:8050/

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:8050
Press CTRL+C to quit
127.0.0.1 - - [30/May/2023 11:08:05] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [30/May/2023 11:08:06] "GET /_dash-layout HTTP/1.1" 200 -
127.0.0.1 - - [30/May/2023 11:08:06] "GET /_dash-dependencies HTTP/1.1" 200 -
127.0.0.1 - - [30/May/2023 11:08:06] "GET /_favicon.ico?v=2.9.3 HTTP/1.1" 200 -
127.0.0.1 - - [30/May/2023 11:08:06] "GET /_dash-component-suites/dash/dcc/async-graph.js HTTP/1.1" 304 -
127.0.0.1 - - [30/May/2023 11:08:06] "GET /_dash-component-suites/dash/dcc/async-plotlyjs.js HTTP/1.1" 304 -
