# Session 2 - Writing Web Crawlers:


In [3]:
import requests
from bs4 import BeautifulSoup as bs
import re


## A simple One Page Crawler

Let's build a web crawler that gets the all time top 100 movies of a given genre:

In [4]:
# first web crawler
def top_100_movies(genre):
    headers = {
        'user-agent': 
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
        'Cache-Control':'no-cache'
    }
    
    link = 'https://www.rottentomatoes.com/top/bestofrt/top_100_' + genre.lower() + '_movies/'
    movie_list = []
    r = requests.get(link, headers=headers, timeout=20)
    soup = bs(r.text, "lxml")
    movies = soup.find('table', class_="table").find_all('a', class_="unstyled articleLink")
    for movie in movies:
        movie_list.append(movie.getText().strip())
    return movie_list

In [5]:
genres = ['animation', 'horror', 'drama', 'comedy', 
              'classics', 'documentary', 'romance',
             'mystery__suspense', 'action__adventure',
             'science_fiction__fantasy','art_house__international']


# 3 top movies in each category
for genre in genres:
    print(f"{genre.title().replace('__', ' & ')}: {', '.join(top_100_movies(genre)[:3])}.")

Animation: Toy Story 4 (2019), Spider-Man: Into the Spider-Verse (2018), Inside Out (2015).
Horror: Us (2019), Get Out (2017), The Cabinet of Dr. Caligari (Das Cabinet des Dr. Caligari) (1920).
Drama: Black Panther (2018), Citizen Kane (1941), Parasite (Gisaengchung) (2019).
Comedy: It Happened One Night (1934), Modern Times (1936), Toy Story 4 (2019).
Classics: It Happened One Night (1934), Modern Times (1936), Citizen Kane (1941).
Documentary: Won't You Be My Neighbor? (2018), I Am Not Your Negro (2017), Apollo 11 (2019).
Romance: It Happened One Night (1934), Casablanca (1942), The Philadelphia Story (1940).
Mystery & Suspense: Citizen Kane (1941), Knives Out (2019), Us (2019).
Action & Adventure: Black Panther (2018), Avengers: Endgame (2019), Mission: Impossible - Fallout (2018).
Science_Fiction & Fantasy: Black Panther (2018), The Wizard of Oz (1939), Avengers: Endgame (2019).
Art_House & International: Parasite (Gisaengchung) (2019), The Cabinet of Dr. Caligari (Das Cabinet des 

### Traversing a Single Domain


In [6]:
r = requests.get('http://en.wikipedia.org/wiki/Kevin_Bacon')
soup = bs(r.text, 'html.parser')

In [7]:
# we need to get the links pointing to the same domain we are in
# Let's first get all links in the webpage

# Links start with the a tag:
lilst = []
for link in soup.find_all('a'):
    if 'href' in link.attrs:
        lilst.append(link.attrs['href'])
print(len(lilst))

887


There are almost 900 links in this one single wiki page.

In [8]:
lilst[5:15]

['/wiki/Philadelphia,_Pennsylvania',
 '/wiki/Kyra_Sedgwick',
 '/wiki/Sosie_Bacon',
 '#cite_note-1',
 '/wiki/Edmund_Bacon_(architect)',
 '/wiki/Michael_Bacon_(musician)',
 '/wiki/Holly_Near',
 'http://baconbros.com/',
 '#cite_note-2',
 '#cite_note-actor-3']

We want only wiki links from the body content or what has to do Kevin Bacon. We don't care about the footer, header or sidebars. 
We will need to use regular expressions such as `^(/wiki/)((?!:).)*$")`:

In [9]:
lilst = []
for link in soup.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')): 
    if 'href' in link.attrs:
        lilst.append(link.attrs['href'])
print(len(lilst))

406


In [11]:
lilst[10:20]

['/wiki/Apollo_13_(film)',
 '/wiki/Mystic_River_(film)',
 '/wiki/Balto_(film)',
 '/wiki/Sleepers',
 '/wiki/The_Woodsman_(2004_film)',
 '/wiki/Animal_House',
 '/wiki/Diner_(1982_film)',
 '/wiki/Tremors_(1990_film)',
 '/wiki/Crazy,_Stupid,_Love',
 '/wiki/Friday_the_13th_(1980_film)']

That looks good. 
We need to put that in a function

In [16]:
import datetime
import random

random.seed(datetime.datetime.now()) 

def getLinks(articleUrl):
    html = requests.get('http://en.wikipedia.org{}'.format(articleUrl)) 
    soup = bs(html.text, 'html.parser')
    return soup.find('div', {'id':'bodyContent'}).find_all('a',href=re.compile('^(/wiki/)((?!:).)*$'))

In [17]:
links = list(getLinks('/wiki/Kevin_Bacon'))

In [23]:
links = links[2:10]

In [None]:
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href'] 
    print(newArticle)
    links = getLinks(newArticle)

To avoid crawling the same page twice and loops, it is extremely important that all internal links discovered are formatted consistently, and kept in a running set for easy lookups, while the program is running.

In the next session we are going to see how to automatically extract data from different pageweb sources. Stay connected!