Welcome to another Python tutorial! Today, we’re building a cool web scraper using `scrapy`.

Ever watched a movie and got so hooked on an actor’s performance that you just had to see more of their work? Why not let a web scraper do the searching for you? Let's make it happen!

## Setting up the Project

Let's get the project set up! First, make sure `scrapy` is installed in your conda environment.

Next, open a terminal and run:

This will initialize a new scrapy project called "TMDB_scraper".

### Creating a Spider

Let’s kick things off by creating a new file inside the `spiders` directory, and name it `tmdb_spider.py`.

We’re going to pull data from the <a href="https://www.themoviedb.org/?language=en-US">TMDB</a> website for our task.

Let's add the following lines to the file:

In [8]:
import scrapy

class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    def __init__(self, subdir="", *args, **kwargs):
        self.start_urls = [f"https://www.themoviedb.org/movie/{subdir}/"]

We have named our spider to be "tmdb_spider", and later we will be able to run the completed spider for any movie by giving its subdirectory on TMDB website.

### Parsing the Movie Page

One of the first challenges in writing a web scraper is navigating a site's structure programmatically.

Let's break down our first parsing function, which handles the initial navigation from a movie's main page to its full credits page.

In [12]:
    def parse(self, response):
        """Parse movie page and navigate to full credits page.
        	Args: Response object containing the movie page HTML
       		Yields: Request object for the full credits page
        """
        cast_link = response.css('.new_button a::attr(href)').get()
        yield response.follow(cast_link, callback=self.parse_full_credits)
        # or we can do this with hard coding:
        # yield response.follow("cast", callback=self.parse_full_credits)

We can use a CSS selector to locate the "Full Cast & Crew" link by targeting an anchor tag within an element with the class `new_button` and extracting its `href` attribute.

Instead of constructing the URL manually, we can use Scrapy's built-in `response.follow()` method. It automatically handles relative URLs.

Since the TMDB website consistently uses "/cast" for every movie's cast and crew page, we can also hardcode the path instead of using a CSS selector.

We specify `parse_full_credits` as the callback function, ensuring that once we reach the cast page, that method handles the next stage of the scraping process.

### Parsing the Full Cast

Moving on to our second function, we'll try to extract links to individual actor pages from the full credits page.

In [16]:
    def parse_full_credits(self, response):
        """Parse the full credits page and navigate to each actor's page.
        	Args: Response object containing the full credits page HTML
            Yields: Request objects for each actor's page
    	"""
        actor_links = response.css('ol.people.credits:not(.crew) li div.info p a::attr(href)').getall()
        for link in actor_links:
            yield response.follow(link, callback=self.parse_actor_page)

Here, `ol.people.credits` targets an ordered list with classes `people` and `credits`.

We can use the CSS pseudoclass `:not(.crew)` to exclude elements with class `crew` -- we'll focus only on actors!

Then, `li div.info p a` navigates through the HTML to find actor links, extracting the URL from each link just as in the previous function.

Finally, we set `parse_actor_page` as the callback for processing each actor's information.

### Parsing the Actor Page

The purpose of our scraping project is extracting structured data. Let's see how our `parse_actor_page` function pulls out an actor's acting credits:

In [20]:
    def parse_actor_page(self, response):
        """Parse an actor's page and extract their acting credits.
            Args: Response object containing the actor's page HTML
            Yields: {"actor": str, "movie_or_TV_name": str}
        """
        actor_name = response.css('h2.title a::text').get()
        
        # Find all section headers and their corresponding tables
        sections = response.css('div.credits_list h3::text').getall()
        tables = response.css('div.credits_list table.credits')
        
        # Find Acting section and process its table
        for i, section in enumerate(sections):
            if section.strip() == "Acting":
                all_rows = tables[i].css('table.credit_group tr')
                unique_titles = {row.css('a.tooltip bdi::text').get().strip() 
                            for row in all_rows 
                            if row.css('a.tooltip bdi::text').get()}
                
                for title in unique_titles:
                    yield {
                        "actor": actor_name,
                        "movie_or_TV_name": title
                    }
                break

Say we're scraping Clint Eastwood's page. Here's exactly what happens:

* First, we grab his name from the page title using the selector `h2.title a::text`.
* Next, we look for sections on the page. The `div.credits_list h3::text` selector finds headers like "Acting", "Directing", "Writing", etc.
* Meanwhile, `div.credits_list table.credits` grabs the corresponding tables of work under each header.
* Upon finding the "Acting" section, we grab its table, which lists rows of all movies and TV shows Eastwood has acted in.
* For each row, we extract the title using `a.tooltip bdi::text`, yielding titles like "A Fistful of Dollars" and "The Good, The Bad, and The Ugly".
* These titles are stored in a set to eliminate duplicates.
* Finally, for each unique title, we yield a dictionary in this format: `{"actor": "Clint Eastwood", "movie_or_TV_name": "A Fistful of Dollars"}`.

## Running the Scraper

Now that we've finished writing the scraper, we can run it for any film we want!

For example, we can scrape relevant films related to the cult Western classic: Sergio Corbucci's <a href="https://www.themoviedb.org/movie/10772-django"> *Django* (1966)<a/>.

In the terminal, we can run the command:

This will create a `.csv` file named "movies" with a column for all actors in *Django* and a column for their movies or TV shows.

## Visualizing the Result

Now that we've scraped the data, we can create quick visualizations of our results!

Our goal is to build a mini movie recommender based on the collected data. For example, we can compute a sorted list of top movies and TV shows that share actors with *Django*. This approach might provide a simple recommendation system for Spaghetti Western films similar to *Django*!

Let's first set up the environment:

In [28]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default="iframe"

Then we can use `pandas` to read the CSV file:

In [30]:
df = pd.read_csv('TMDB_scraper/movies.csv')

Then, we can create a set of all actors who appeared in *Django* and extract their names:

In [32]:
django_actors = set(df[df['movie_or_TV_name'] == 'Django']['actor'])

We can initialize an empty dictionary that will store each movie/show and its count of shared actors with *Django*.

We will iterate through each unique movie/show (excluding Django), count and store how many actors are shared.

In [34]:
movie_counts = {}
for movie in df[df['movie_or_TV_name'] != 'Django']['movie_or_TV_name'].unique():
    shared_actors = set(df[df['movie_or_TV_name'] == movie]['actor']) & django_actors
    if len(shared_actors) > 0:  # Only include movies with shared actors
        movie_counts[movie] = len(shared_actors)

We then transform our counts into a structured DataFrame and sort it by the number of shared actors in descending order:

In [36]:
df = pd.DataFrame([
    {'movie': movie, 'shared_actors': count} 
    for movie, count in movie_counts.items()
]).sort_values('shared_actors', ascending=False)

In [37]:
df

Unnamed: 0,movie,shared_actors
139,Compañeros,7
177,The Mercenary,6
42,"Texas, Adios",5
135,The Hellbenders,4
108,Navajo Joe,4
...,...,...
395,Un brivido sulla pelle,1
396,"The Handsome, The Ugly, And The Stupid",1
397,Desert Commandos,1
398,L'Ottimista Sorridente,1


Finally, we can create a horizontal bar chart showing the top 15 movies/shows with the most shared actors with Django:

In [39]:
fig = px.bar(df.head(15),  # Get top 15
            x='shared_actors', 
            y='movie',
            orientation='h',
            title='Top 15 Movies/TV Shows Sharing Actors with Django',
            labels={'shared_actors': 'Number of Shared Actors',
                   'movie': 'Movie/TV Show'})

In [40]:
fig.show()

Nice! We've discovered other productions featuring the same actors from *Django*.

Now it's time for me to binge-watch some B-class Italowesterns...