## Quiz

In [1]:
import pandas as pd
import wptools
import os
import requests
from PIL import Image
from io import BytesIO

In [2]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Arrival_(film)',
 'Baby_Driver',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'The_Dark_Knight_(film)',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [3]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

The Jupyter Notebook below contains template code that:

- Contains `title_list`, which is a list of all of the Wikipedia page titles for each movie in the Rotten Tomatoes Top 100 Movies of All Time list.


- Creates an empty list, `df_list`, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame.


- Creates an empty folder, `bestofrt_posters`, to store the downloaded movie poster image files.


- Creates an empty dictionary, `image_errors`, to fill to keep track of movie poster image URLs that don't work.


- Loops through the Wikipedia page titles in `title_list` and:

    * Stores the ranking based on its position in `title_list`. Ranking is needed so we can join this with the master DataFrame later. We can't join on title because the titles of the Rotten Tomatoes and the Wikipedia pages differ.
    
    - Uses `try` and `except` blocks to attempt to query MediaWiki for a movie poster `image URL` and to attempt to download that image. If the attempt fails and an error is encountered, the offending movie is in `image_errors`.
    
    - Appends a dictionary with `ranking`, `title`, and `poster_url` as the keys and the extracted values for each as the values to `df_list`.
    
    - Creates a DataFrame called `df` by converting `df_list` using the `pd.DataFrame` constructor.

#### Note: the cell below, if correctly implemented, will likely take ~5 minutes to run.

In [4]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        
        # Get the title
        page = wptools.page(title, silent=True)
        
        # Get the image
        images = page.get().data['image']
        
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
2
3
4
5
6
7
8
9
10
11
12
13


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


13_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
14
14_The_Battle_of_Algiers: cannot identify image file <_io.BytesIO object at 0x7f90818d1ca8>
15
16
17
18
19
20
21
21_Rear_Window: cannot identify image file <_io.BytesIO object at 0x7f90814d8d00>
22
23
24
25
26
27
28
29
30
31
31_12_Angry_Men_(1957_film): cannot identify image file <_io.BytesIO object at 0x7f90815edb48>
32
32_The_400_Blows: cannot identify image file <_io.BytesIO object at 0x7f908159cbf8>
33
34
34_All_Quiet_on_the_Western_Front_(1930_film): cannot identify image file <_io.BytesIO object at 0x7f908159cbf8>
35
36
37
38
39
40
41
42
42_The_Conformist_(film): cannot identify image file <_io.BytesIO object at 0x7f90815d7c50>
43
43_Rebecca_(1940_film): cannot identify image file <_io.BytesIO object at 0x

API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


44_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
45
46
47
48
48_The_39_Steps_(1935_film): cannot identify image file <_io.BytesIO object at 0x7f90815d7c50>
49
50
50_Gone_with_the_Wind_(film): cannot identify image file <_io.BytesIO object at 0x7f908115f728>
51
52
53
53_Rome,_Open_City: cannot identify image file <_io.BytesIO object at 0x7f90818d1db0>
54
54_Tokyo_Story: cannot identify image file <_io.BytesIO object at 0x7f908190bb48>
55
56
57
58
59
60
60_High_Noon: cannot identify image file <_io.BytesIO object at 0x7f908116eaf0>
61
62
62_On_the_Waterfront: cannot identify image file <_io.BytesIO object at 0x7f908115fe08>
63
64
65
66
66_The_Grapes_of_Wrath_(film): cannot identify image file <_io.BytesIO object at 0x7f90818d14c0>
67
67_Roman_Holiday: cannot identify ima

One you have completed the above code requirements, read and run the three cells below and interpret their output.

In [5]:
for key in image_errors.keys():
    print(key)

13_A_Hard_Day%27s_Night_(film)
14_The_Battle_of_Algiers
21_Rear_Window
31_12_Angry_Men_(1957_film)
32_The_400_Blows
34_All_Quiet_on_the_Western_Front_(1930_film)
42_The_Conformist_(film)
43_Rebecca_(1940_film)
44_Rosemary%27s_Baby_(film)
48_The_39_Steps_(1935_film)
50_Gone_with_the_Wind_(film)
53_Rome,_Open_City
54_Tokyo_Story
60_High_Noon
62_On_the_Waterfront
66_The_Grapes_of_Wrath_(film)
67_Roman_Holiday
68_Man_on_Wire
72_Battleship_Potemkin


In [6]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,1,The_Wizard_of_Oz_(1939_film),https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen_Kane,https://upload.wikimedia.org/wikipedia/commons...
2,3,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
3,4,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
4,5,Inside_Out_(2015_film),https://upload.wikimedia.org/wikipedia/en/0/0a...
5,6,The_Godfather,https://upload.wikimedia.org/wikipedia/en/1/1c...
6,7,Metropolis_(1927_film),https://upload.wikimedia.org/wikipedia/en/9/97...
7,8,E.T._the_Extra-Terrestrial,https://upload.wikimedia.org/wikipedia/en/6/66...
8,9,Casablanca_(film),https://upload.wikimedia.org/wikipedia/commons...
9,10,Moonlight_(2016_film),https://upload.wikimedia.org/wikipedia/en/8/84...
