# Wikipedia Post-Process

By crawling Wikipedia we were able to obtain a bunch of tables having similar, but not identical, schema. We want to put everything into a master schema so that we can safely and consistently work with these mined data.

In [1]:
import os
import glob
import re

import pandas as pd

from dotenv import load_dotenv, find_dotenv
#find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)

True

In [2]:
data_dir = '../data/raw/wikipedia-movies-20181021/'

files = glob.glob(data_dir + 'movies-*.csv')
# 118 files total
tables = dict((f, pd.read_csv(f)) for f in files)
pattern = r'movies-(\d{4}).csv'
# Want to reference year
tables = dict(
    (int(re.findall(pattern, k)[0]), v) 
    for k, v in tables.items()
)
# There are some columns that differ only in having caps vs no caps
# or missing a white space another has
for year, t in tables.items():
    t['year'] = year    
    t.columns = [c.lower().replace(' ', '') for c in t.columns]

In [3]:
# How much data do we have?
count = 0
for t in tables.values():
    count += len(t)
print(count)

29004


In [4]:
# Let's look what we have
columns = set(c.lower() for t in tables.values() for c in t.columns)
metadata = pd.DataFrame(index=tables.keys())
for col in columns:
    for year in metadata.index:
        table_cols = [c.lower() for c in tables[year].columns]
        metadata.loc[year, col] = (col in table_cols)

summary = metadata.sum()
summary = summary.reset_index()
summary.columns = ['column_name', 'count']
summary

Unnamed: 0,column_name,count
0,director,117
1,castandcrew,1
2,releasedate,2
3,ref.,4
4,note,34
5,title,118
6,cast,90
7,notes,72
8,studio/notes,1
9,studio,1


In [5]:
def count_key(key):
    n = summary.loc[
        summary['column_name'].str.contains(key), 
        'count'
    ].sum()
    print(f'total tables with {key}:', n)

count_key('note')
count_key('genre')
count_key('cast')

total tables with note: 117
total tables with genre: 118
total tables with cast: 118


All but one table has a note column (probable the one with an 'unnamed:4' column).
All tables have a genere, and all tables have a cast.
We have 6 tables that have release date info, beyond just the year.
All of these we will already get with the OMDb API, but for fun let's be completists get as much info as we can from these tables.

In [6]:
film_list = pd.concat([df[['year', 'title']] for df in tables.values()])
# Did we get everyone?
len(film_list)
# Yes.

29004

In [7]:
# The basic tier for the OMDb patrion page gives 100K requests a day, for $1 per month donation.
# For $5 per month also get 250K requests and access to the poster API.