# Web scraping with scrapy

In this post, we'll be making a web scraper - a tool that extracts data from webpages. Suppose we want to find out which movies share the most number of actors with our favorite movie, say, *The Shawshank Redemption*. A good place to find this information is [IMDB](https://imdb.com), which has 
1. movie pages containing its cast list, and 
2. actor pages that list their filmography.

How would we go about finding which movies share the largest number of actors with *Shawshank*? We would:
1. Start from IMDB's movie page for *The Shawshank Redemption*: https://www.imdb.com/title/tt0111161/
2. For each actor in its cast list, go to the actors page and collect all the titles in their filmography.
3. See which other movies appear the most frequently amongst the collected titles.

The `scrapy` `spider` described below will automate steps 1 and 2. The set-up is a bit different this time - instead of a notebook we're writing this in a `.py` script and running it in the terminal. 


In [73]:
import scrapy

ModuleNotFoundError: No module named 'scrapy'

In [None]:
class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    # starting show: The Shawshank Redemption
    start_urls = ['https://www.imdb.com/title/tt0111161/']

    def parse(self, response):
        """
        from a movie page, navigate to the Cast & Crew page
        then call parse_full_credits(self,response) on the credits
        """
        credit_url = response.url + "fullcredits"
        yield scrapy.Request(credit_url, callback=self.parse_full_credits)

    def parse_full_credits(self, response):
        """
        from Cast & Crew page, yield a scrapy.Request for the
        page of each actor with parse_actor_page(self, response)
        """
        actor_list = ["https://imdb.com" + a.attrib["href"]
                      for a in response.css("td.primary_photo a")]
        for actor_url in actor_list:
            yield scrapy.Request(actor_url, callback=self.parse_actor_page)

    def parse_actor_page(self, response):
        """
        for each movie/show on the actor page, return a dictionary
        of the form {"actor" : actor_name, "movie_or_TV_name" :
        movie_or_TV_name}.
        """
        # select name of actor
        n = response.css("div.article.name-overview span.itemprop::text").get()
        # all_films includes all films credited as actor or other roles
        all_films = response.css("div.filmo-row")
        # filter only those credited as actor
        films = [f.css('b a::text').get()
                 for f in all_films if f.attrib['id'].split('-')[0] == 'actor']
        for film in films:
            yield {"actor": n, "movie_or_TV_name": film}

# scrapy crawl imdb_spider -o results.csv

In [69]:
import pandas as pd
import numpy as np

In [70]:
results = pd.read_csv('results.csv')

In [71]:
ranking = results.groupby("movie_or_TV_name").size().reset_index(name='counts')
ranking = ranking.sort_values(by='counts',ascending=False).reset_index(drop=True)

In [72]:
ranking[0:10]

Unnamed: 0,movie_or_TV_name,counts
0,The Shawshank Redemption,65
1,ER,11
2,Law & Order,10
3,CSI: Crime Scene Investigation,10
4,The West Wing,9
5,NYPD Blue,9
6,Cold Case,9
7,The Practice,9
8,The Twilight Zone,8
9,L.A. Law,8
