# Wikipedia Crawl

Using the OMDb API we can pull data on a film, provided we supply the film's title. Thus, to pull ample film data from IMDb, we will need an ample list of film names. After a bit of Googling, I found a Wikipedia page listing movies, but it is organized through several subpages. We'll use the BeautifulSoup package to crawl these pages to generate our list of movies. The webcrawling approach we take here is based to some degree off of the tutorial https://goo.gl/Bm1cdD.

There are two paths of attack:
1. `https://en.wikipedia.org/wiki/Lists_of_films` has data on every movie ever made. There are a few different hierarchies movies are classified by on this. This catalogs every movie ever made.
2. `https://en.wikipedia.org/wiki/Lists_of_American_films` is just American films, organized by year. This is easier to work with, since the organization is more orderly. Films are organized by year, and have generally similar layouts.

As a starting point I'll just do the easier thing and use the American movie list. In the future if we want more training data then we can revisit getting foreign films. 

There are a couple inconsistancies between the different years' pages, which we will have to consider:
1. Movies before 1900 have a single page and are held in a single table with multi-column spans grouping the different years.
2. Starting around 2014, there is are multi-row spans for the film's opening month: there is a tacky vertical box spelling out the month name vertically over multiple rows. Worse still the day of the month is then in a seperate column with multi-row spans, and these two columns are grouped together into a single "Opening" column. 
3. Other pages have tables split up by first letter of the name of the film, and don't necessarily have the same column names from table to table, even within a single year's page.
4. Occasionally there is a typo in the table. An empty cell may not have been coded in, or an extra cell may be unintentionally present in one row. As Wikipedia is crowd-edited, these errors may randomly pop up and go away as edits are made.


## 1. Generalities on web crawling tables

A webpage is basically a tree with nodes labeled by tags. `BeautifulSoup` packages this tree in an object that is easy to navigate and search, based off of the html tags attached to the nodes. Here's how we are going to pull the movie data for any given year.
1. We pull the raw html using a `GET` request. It is generally frowned upon to make many requests to a website in a short period of time. For us this means that a request may be denied if too many are made too quickly. So, after each page we should pause for a moment.
2. Dump the html into a `BeautifulSoup` object.
3. From inspecting the raw html it looks like the data we want is always in a table, which we can find with the `<table>` tag. There are some other tables, such as the footers and page navigation that we don't want. Again from inspection, it looks like the tables we do want always have a `class=wikitable` attribute.
4. The tables are then split into table rows, as indicated by a `<tr>` tag.. A row either consitsts of headers (`<th>` table header tag) or data (`<td>` table data tag), or a combination of the two. 

In [1]:
import os

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

from dotenv import load_dotenv, find_dotenv
#find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)

## 2. Pulling from tables

As each Wikipedia film page generallhy has the data seperated into several tables, the first step is to make a function that can process a single table. For prototyping we take a sample of a (portion of) one of the tables we are interested in parsing. 

(Aside: For some reason I got it into my head that it is bad practice to use docstrings for long, multiline strings because they are meant for documentation purposes. I'm only using them here because, well, prototyping. Also, I can't justify why using docstrings as strings is actually bad practice.)

In [2]:
html = """
<table>
<tr><th colspan="2">Opening</th>
  <th style="width:20%;">Title</th>
  <th style="width:10%;">Director</th>
  <th>Cast</th>
  <th style="width:13%">Genre</th>
  <th style="width:20%">Notes</th>
  <th>Ref.</th>
</tr>

<tr>
  <th rowspan="22"><b>J<br />A<br />N<br />U<br />A<br />R<br />Y<br /></b></th>
  <th><b>2</b></th>
  <td><i><a href="/wiki/The_Woman_in_Black_2:_Angel_of_Death" class="mw-redirect" title="The Woman in Black 2: Angel of Death">The Woman in Black 2: Angel of Death</a></i></td>
  <td><a href="/wiki/Tom_Harper_(director)" title="Tom Harper (director)">Tom Harper</a></td>
  <td><a href="/wiki/Phoebe_Fox" title="Phoebe Fox">Phoebe Fox</a><p><a href="/wiki/Jeremy_Irvine" title="Jeremy Irvine">Jeremy Irvine</a></p><p><a href="/wiki/Helen_McCrory" title="Helen McCrory">Helen McCrory</a></p><p><a href="/wiki/Adrian_Rawlins" title="Adrian Rawlins">Adrian Rawlins</a></p><p><a href="/wiki/Ned_Dennehy" title="Ned Dennehy">Ned Dennehy</a></p></td>
  <td><a href="/wiki/Horror_film" title="Horror film">Horror</a></td>
  <td><a href="/wiki/Relativity_Media" title="Relativity Media">Relativity Media</a><p>Sequel to <i><a href="/wiki/The_Woman_in_Black_(2012_film)" title="The Woman in Black (2012 film)">The Woman in Black (2012)</a></i></p></td>
  <td></td>
</tr>

<tr>
  <th rowspan="2"><b>9</b></th> 
  <td><i><a href="/wiki/Taken_3" title="Taken 3">Taken 3</a></i></td>
  <td><a href="/wiki/Olivier_Megaton" title="Olivier Megaton">Olivier Megaton</a></td>
  <td><a href="/wiki/Liam_Neeson" title="Liam Neeson">Liam Neeson</a><p><a href="/wiki/Forest_Whitaker" title="Forest Whitaker">Forest Whitaker</a></p><p><a href="/wiki/Famke_Janssen" title="Famke Janssen">Famke Janssen</a></p><p><a href="/wiki/Maggie_Grace" title="Maggie Grace">Maggie Grace</a></p></td>
  <td><a href="/wiki/Action_film" title="Action film">Action</a></td>
  <td><a href="/wiki/20th_Century_Fox" title="20th Century Fox">20th Century Fox</a></td>
  <td></td>
</tr>
</table>
"""

In [3]:
soup = BeautifulSoup(html, 'html.parser')
# Get a table to work with
tables = soup.find_all('table') #, {'class': 'wikitable'})
table = tables[0]

# Get all rows from table, i.e., all <tr> tags
rows = table.find_all('tr')

# row of headers
row0 = rows[0]
# row with awkward month column
row1 = rows[1]
# row with awkward day column
row2 = rows[2]

**Fun fact.** For some reason in Jupyter when operations are ran multiple times on BeautifulSoup Tag objects, the attributes are getting stripped and causing errors on repeated runs. I should look into why this is.

### 2.1. Processing table headers

A naive approach to finding the headers in a table would be to find all `<th>` tags, and then collect them. However, some headers have a `colspan` attribute which makes the column span multiple cells of a row. 
Our code will need a way of dealing with these multicolumn. The approach we will take is to treat each cell spanned by a multicolumn as its own column, but add a suffix to indicate that columns are connected. In the example html, the `Opening` header has a colspan of 2, so we will make two columns `Opening__0` and `Opening__1`. (The `__` chosen to be unlikely to occur in the wild.)

In [4]:
# Check if row is a header row and if so, build the columns.
# Should be refactored to be more pythonic?
# I Want list comprehensions...
def parse_header_row(row):
    # If exists <td> tags then not a header row
    if row.find_all('td'):
        raise ValueError("`row` is not a table header.")
    
    columns = []
    for x in row.find_all('th'):
        colspan = int(x.attrs.pop('colspan', 1))
        if colspan > 1:
            columns += [f'{x.text.strip()}__{i}' for i in range(colspan)]
        else:
            columns += [x.text.strip()]
    return columns

columns = parse_header_row(row0)
columns

['Opening__0',
 'Opening__1',
 'Title',
 'Director',
 'Cast',
 'Genre',
 'Notes',
 'Ref.']

### 2.2. Processing rows

Again, simply finding all the `<td>` tags in a row for the data is too naive. The first issue is that there are multirows where a single value is spread over cells in multiple rows of the table. The second issue is that some tables have typos and a cell in a row is not coded. To deal with these issues, we will take the column of the table as an immutable specification of how many cells there should be per row. 

To deal with multirows, we will need a `counter` which propagates values from earlier in the table for however many rows as specified by a multirow attribute when encountered. If a row is having values propagated to it from earlier in the table, it will have fewer `<td>` tags than expected. If a row has multiple cells with values being propogated, then the mapping between cell number and data tags is not direct (though it is monotonic). We will deal with this by incrementing over the table row only when a value is not propagated using a `cell_cursor`.

Most missing or extra data cells only occur at the end of a row, as anywhere else would cause the cells of the row to be misaligned with the headers and more easily seen by editors of the page. So, we can also deal the missing cell issue by filling in the rest of the cells of a row with `None` after the `cell_cursor` has reached the end of the table row.

In [5]:
# Again, there has to be a more elegant way to do this...
def parse_data_row(row, columns, counters):
    cells = row.find_all(['th', 'td'])
    cell_cursor = 0
    row_processed = []

    for col in columns:

        # Check if values to propagate
        if counters[col][0] > 0:
            cell_value = counters[col][1]
            counters[col][0] -= 1   
        # If not propagate, get from cell    
        elif cell_cursor < len(cells):
            cell = cells[cell_cursor]
            rowspan = int(cell.attrs.pop('rowspan', 1))
            cell_value = cell.text.strip()

            if rowspan > 1:
                counters[col] = [rowspan - 1, cell_value]

            cell_cursor += 1
        else:
            cell_value = None

        row_processed.append(cell_value)     
        
    return row_processed

Let's check that the counters are behaving correctly. Notice the values of the `Opening__0` and `Opening__1` cells where multirows are present.

In [6]:
# For entries with rowspan, save value and amount left to fill
counters = dict((key, [0, None]) for key in columns)

print('initial counters:', counters, '', sep='\n')
parsed1 = parse_data_row(row1, columns, counters)

print('parsed row:', parsed1, '', sep='\n')
print('counters:', counters, '', sep='\n')

initial counters:
{'Opening__0': [0, None], 'Opening__1': [0, None], 'Title': [0, None], 'Director': [0, None], 'Cast': [0, None], 'Genre': [0, None], 'Notes': [0, None], 'Ref.': [0, None]}

parsed row:
['JANUARY', '2', 'The Woman in Black 2: Angel of Death', 'Tom Harper', 'Phoebe FoxJeremy IrvineHelen McCroryAdrian RawlinsNed Dennehy', 'Horror', 'Relativity MediaSequel to The Woman in Black (2012)', '']

counters:
{'Opening__0': [21, 'JANUARY'], 'Opening__1': [0, None], 'Title': [0, None], 'Director': [0, None], 'Cast': [0, None], 'Genre': [0, None], 'Notes': [0, None], 'Ref.': [0, None]}



In [7]:
parsed2 = parse_data_row(row2, columns, counters)
print('parsed row:', parsed2, '', sep='\n')
print('counters:', counters, '', sep='\n')

parsed row:
['JANUARY', '9', 'Taken 3', 'Olivier Megaton', 'Liam NeesonForest WhitakerFamke JanssenMaggie Grace', 'Action', '20th Century Fox', '']

counters:
{'Opening__0': [20, 'JANUARY'], 'Opening__1': [1, '9'], 'Title': [0, None], 'Director': [0, None], 'Cast': [0, None], 'Genre': [0, None], 'Notes': [0, None], 'Ref.': [0, None]}



Note the `text` attribute in BeautifulSoup strips tags like `<br />` which should possibly be interpreted as `\n` characters, or some other divider. For now we will just ignore this. The only place this really matters is with the "Cast" columns, where it is clumping together actor names.

### 2.3. Parsing whole table

Parsing the entire table is just a matter of parsing the headers, setting up the counters, and iterating over all rows.

In [8]:
def parse_table(table):
    """Parse rows of table."""
    rows = table.find_all('tr')

    header_row = rows.pop(0)
    columns = parse_header_row(header_row) 
    # For entries with rowspan, save value and amount left to fill        
    counters = dict((key, [0, None]) for key in columns)
    
    table_parsed = []
    for row in rows:
        row_parsed = parse_data_row(row, columns, counters)
        table_parsed.append(row_parsed)
        
    table_parsed = pd.DataFrame(table_parsed, columns=columns)
    return table_parsed


In [9]:
url = f'https://en.wikipedia.org/wiki/List_of_American_films_of_2015'
# Get raw html
response = requests.get(url)  
# Soupify
soup = BeautifulSoup(response.text, 'html.parser')

# Get a table to work with
tables = soup.find_all('table', {'class': 'wikitable'})
table = tables[0]

parse_table(table).head()

Unnamed: 0,Opening__0,Opening__1,Title,Director,Cast,Genre,Notes,Ref.
0,JANUARY,2,The Woman in Black 2: Angel of Death,Tom Harper,Phoebe Fox\nJeremy Irvine\nHelen McCrory\nAdri...,Horror,Relativity Media\nSequel to The Woman in Black...,
1,JANUARY,9,Taken 3,Olivier Megaton,Liam Neeson\nForest Whitaker\nFamke Janssen\nM...,Action,20th Century Fox,
2,JANUARY,9,Let's Kill Ward's Wife,Scott Foley,Scott FoleyPatrick WilsonDonald FaisonJames Ca...,Comedy,Well Go USA Entertainment,[11]
3,JANUARY,14,Match,Stephen Belber,Patrick StewartCarla GuginoMatthew LillardRob ...,Drama,IFC Films,[12]
4,JANUARY,16,Blackhat,Michael Mann,Chris HemsworthViola DavisManny MontanaTang Wei,Action,Universal Pictures,[13]


## 3. Next steps

There are a few odds and ends that ended up in the final script for this portion of the project not mentioned here.


### 3.1. Cells with too much information

The `text` attribute in BeautifulSoup is perhaps too aggressive in stripping tags from text fields, and leaving us with the concatenated "Cast" names above. A more cautious approach would be to leave a marker where tags were stripped, and deal with them later. This may be more practiable, as different columns would need to be handled separately, which would be assuming too much structure from table to table on Wikipedia. For example in the `List_of_American_films_of_2015` table the `<br \>` tags seperate actors in the "Cast" column, but spread out a note over multiple lines in the "Notes" column. 

There is a `stripped_strings` attribute in BeautifulSoup which returns a generator of the text which we can then for example join with whatever seperator we want.

In [10]:
rows = table.find_all('tr')
row4 = rows[3]

In [11]:
cast_column = row4.find_all('td')[2]
[string for string in cast_column.stripped_strings]

['Scott Foley', 'Patrick Wilson', 'Donald Faison', 'James Carpinello']

In [12]:
'|'.join(cast_column.stripped_strings)

'Scott Foley|Patrick Wilson|Donald Faison|James Carpinello'

### 3.2. Crawling a full page

Each page typically has multiple tables. Unfortunately, even on a single page not all the tables have the same column header names. This causes Pandas to add extra columns when we append multiple data frames together from the tables from a single page. Specifically, the 1962 page has a "Notes/Studio" column in the first table which is renamed to just "Notes" in all subsequent tables. The solution we went with was to take the columns from the first table on a page as master, and force all subsequent tables to have the same column names. Another option would to just let Pandas do its thing, and deal with this later.