## Gathering Data

### Source: Files on Hand
Udacity compiled, based on [this](https://www.rottentomatoes.com/top/bestofrt/) list from RottenTomatoes, a file with the Top 100 movies. 

#### Flat File Structure
Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

**Advantages of flat files** include:

- They're text files and therefore human readable
- Lightweight
- Simple to understand
- Software that can read/write text files is ubiquitous, like text editors
- Great for small dataset

**Disadvantages of flat files**, in comparison to relational databases, for example, include:

- Lack of standards
- Data redundancy
- Sharing data can be cumbersome
- Not great for large datasets

#### Flat Files in Python
Pandas is especially suited to read tabular data.

[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) can handle all kinds of flat files, including TSV files by changing the parameters. Try importing the .tsv file based on the Top 100 Movies from RottenTomatoes. 

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import glob
import wptools
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
from sqlalchemy import create_engine

In [2]:
df = pd.read_csv('support-files/02_Gathering-Data/bestofrt.tsv', sep='\t')
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


### Source: Web Scraping
Next, we'd like to get Rotten Tomatoes' audience scores and the number of audience reviews to add to our dataset. However, this is not easily accessible from the website and to get this data we will need to do web scraping, which allows us to extract data from websites using code.

In order to do that, we'll use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).

Read [this](https://medium.com/towards-data-science/ethics-in-web-scraping-b96b18136f01) article on the ethical issues involved in web scraping.

In [3]:
from bs4 import BeautifulSoup

In [4]:
with open ('support-files/02_Gathering-Data/rt_html/et_the_extraterrestrial.html') as file:
    soup = BeautifulSoup(file, "lxml")

Note: `.find` [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)

#### Getting the `movie-title`

In [5]:
# find the title of the web page (not the title of the movie exactly)
soup.find('title')

<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>

Note: a tag's children is available in a list called [.contents](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)

In [6]:
# since the title tag has no children, indexing won't be useful 
# to get just the movie title, i.e., remove '- Rotten Tomatoes'
soup.find('title').contents[0]

'E.T. The Extra-Terrestrial\xa0(1982) - Rotten Tomatoes'

In [7]:
# we can, however, use string slicing
title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
title

'E.T. The Extra-Terrestrial\xa0(1982)'

Note: `\xa0` is unicode for non-breaking space! This [discussion](https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python) on StackOverflow helps remove it.

In [8]:
# import unicodedata
# title = unicodedata.normalize('NFKD', title)
# title

### Quiz: 

You're going to use Beautiful Soup to extract our desired **Audience Score** metric and **number of audience ratings**, along with the **movie title** (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

In [9]:
# displaying the whole "soup" is nice to investigate the html page,
# however, it's not visually appealing when uploading to GitHub
# so I'll comment out this cell
# soup

#### First: Get `audience-score`

Note: Searching by [CSS class](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class)

In [10]:
# you can search by class in BeautifulSoup since class is a reserved worth in Python,
# the keyword argument for class is "class_"
soup.find_all('div', class_='meter-value')

[<div class="meter-value">
 <span class="superPageFontColor" style="vertical-align:top">72%</span>
 </div>]

In [11]:
# use .contents to get only the line I'm interested in
soup.find('div', class_='meter-value').contents[1]

<span class="superPageFontColor" style="vertical-align:top">72%</span>

Note: `get_text()` [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)

In [12]:
# use get_text() to get ONLY the audience rating
# get_text() will return only the human-readable text inside a document or tag
soup.find('div', class_='meter-value').contents[1].get_text()

'72%'

In [13]:
# remove the % so we can convert it to an int later
audience_score = soup.find('div', class_='meter-value').contents[1].get_text()[:-len('%')]
audience_score

'72'

#### Second: Get `number of audience ratings`

In [14]:
# get the whole div
soup.find_all('div', class_='audience-info hidden-xs superPageFontColor')

[<div class="audience-info hidden-xs superPageFontColor">
 <div>
 <span class="subtle superPageFontColor">Average Rating:</span>
             3.5/5
                 </div>
 <div>
 <span class="subtle superPageFontColor">User Ratings:</span>
         32,313,030</div>
 </div>]

In [15]:
# filter by the line I need
soup.find('div', class_='audience-info hidden-xs superPageFontColor').contents[3]

<div>
<span class="subtle superPageFontColor">User Ratings:</span>
        32,313,030</div>

In [16]:
# filter some more using get_text()
soup.find('div', class_='audience-info hidden-xs superPageFontColor').contents[3].get_text()

'\nUser Ratings:\n        32,313,030'

In [17]:
# slice to get only the number of audience ratings
num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor').contents[3].get_text()[len('\nUser Ratings:\n        '):]
num_audience_ratings

'32,313,030'

In [18]:
# remove the commas so we can convert it to an int later
num_audience_ratings = num_audience_ratings.replace(',', '')
num_audience_ratings

'32313030'

#### Third: Get the `movie title`

In [19]:
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]

'E.T. The Extra-Terrestrial\xa0(1982)'

#### Next: create a loop to extract this information for all 100 files

In [20]:
import os

In [21]:
# list of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'support-files/02_Gathering-Data/rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, "lxml")
        
        # title
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        #title = unicodedata.normalize('NFKD', title)
        
        # audience score
        audience_score = soup.find('div', class_='meter-value').contents[1].get_text()[:-len('%')]
        
        # number of audience ratings
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor').contents[3].get_text()[len('\nUser Ratings:\n        '):]
        num_audience_ratings = num_audience_ratings.replace(',', '')
        
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

In [22]:
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,12 Angry Men (Twelve Angry Men) (1957),97,103672
1,The 39 Steps (1935),86,23647
2,The Adventures of Robin Hood (1938),89,33584
3,All About Eve (1950),94,44564
4,All Quiet on the Western Front (1930),89,17768


#### Flashforward 1
Once this newly-created dataframe containing the audience scores and number of audience ratings is joined with the .tsv file cointaining information on critic ratings, a visualization like [this](https://public.tableau.com/app/profile/david.venturi/viz/BestofRottenTomatoesCriticvs_AudienceScores/BestofRottenTomatoesCriticvs_AudienceScores) could be created! I'll get there soon enough :), as soon as I get to the assessing and cleaning lessons.

### Source: Downloading Files from the Internet
Starting the Roger Ebert Review Word Cloud

In [23]:
import requests
import os

In [24]:
folder_name = 'support-files/02_Gathering-Data/ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [25]:
ebert_review_urls = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt']

Note1: See *warning* in the Requests [documentation](https://docs.python-requests.org/en/latest/user/quickstart/#make-a-request)

Note2: `open` [function's](https://docs.python.org/3/tutorial/inputoutput.html#tut-files) mode argument

In [26]:
# iterate through the url list
for url in ebert_review_urls:
    
    # create the request
    response = requests.get(url)
    
    # access the content and write to a file
    with open(os.path.join(folder_name,
                          url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)

In [27]:
# there should be 88 files
len(os.listdir(folder_name))

88

12 movies in the top 100 Rotten Tomatoes list didn't have reviews on Roger Ebert's site.

### Encodings and Character Sets Articles

- [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) by Joel Spolsky
- [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/)

### Text Files in Python
The [glob](https://docs.python.org/3/library/glob.html) library makes opening files with similar path structure (like our folder of Roger Ebert review text files) simple.

Whenever you open a file in Python, it's good practice to especify the encoding. The exact encoding depend on the source of the text. You can look at the page's html to figure out what was the encoding used. It's the `<meta chartset>` tag. In Roger Ebert's website, it's `utf-8`.

In [28]:
# glob is especially useful if I have different file formats in this folder

df_list =[]
for ebert_review in glob.glob('support-files/02_Gathering-Data/ebert_reviews/*.txt'):    
    with open(ebert_review, encoding='utf-8') as file:
        title = file.readline()[:-1]
        review_url = file.readline()[:-1]
        review_text = file.read()
        
        # append to list of dictionaries
        df_list.append({'title': title,
                       'review_url': review_url,
                       'review_text': review_text})

In [29]:
# create dataframe
df = pd.DataFrame(df_list, columns=['title', 'review_url', 'review_text'])

In [30]:
df.head()

Unnamed: 0,title,review_url,review_text
0,The Wizard of Oz (1939),http://www.rogerebert.com/reviews/great-movie-...,As a child I simply did not notice whether a m...
1,Metropolis (1927),http://www.rogerebert.com/reviews/great-movie-...,The opening shots of the restored “Metropolis”...
2,Battleship Potemkin (1925),http://www.rogerebert.com/reviews/great-movie-...,"""The Battleship Potemkin” has been so famous f..."
3,E.T. The Extra-Terrestrial (1982),http://www.rogerebert.com/reviews/great-movie-...,Dear Raven and Emil:\n\nSunday we sat on the b...
4,Modern Times (1936),http://www.rogerebert.com/reviews/modern-times...,"A lot of movies are said to be timeless, but s..."


#### More Information

1. Opening and reading files in Python:
- [Stack Overflow: Best Practices for Opening Files in Python](https://stackoverflow.com/a/22288895)
- [Stack Overflow: The Correct, Fully Pythonic Way to Read a File](https://stackoverflow.com/a/8010133)

2. Glob programming
- [Wikipedia: Glob programming](https://en.wikipedia.org/wiki/Glob_(programming))
- [Python glob Library](https://docs.python.org/3/library/glob.html)

### Source: APIs (Application Programming Interfaces)
We could scrape the image URL from the HTML. But a better way is to access them through an API (Application Programming Interface). Each movie has its poster on its Wikipedia page, so we can use Wikipedia's API.

APIs give you relatively easy access to data from the Internet. Twitter, Facebook, Instagram all have APIs and there are many open-source APIs.

In this lesson we'll be using [MediaWiki](https://www.mediawiki.org/wiki/MediaWiki), which is a popular open-source API for Wikipedia.

#### When Given a Choice, Pick API over Scraping
Scraping is brittle and breaks with web layout redesigns because the underlying HTML has changed.

#### MediaWiki API
MediaWiki has a great [tutorial](https://www.mediawiki.org/wiki/API:Tutorial) on their website on how their API calls are structured. It's a nice and simple example and they explain the various moving parts:

- The endpoint (important takeaway: there is nothing special about this URL!)
- The format
- The action
- Action-specific parameters

#### wptools Library
There are a bunch of different access libraries for MediaWiki to satisfy the variety of programming languages that exist. Here is a [list](https://www.mediawiki.org/wiki/API:Client_code#Python) for Python. This is pretty standard for most APIs. Some libraries are better than others, which again, is standard. For a MediaWiki, the most up to date and human readable one in Python is called [wptools](https://github.com/siznax/wptools).

#### Quiz
Get the page object for the E.T. The Extra-Terrestial Wikipedia page. Here is the [E.T. Wikipedia page](https://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial) for easy reference.

In [31]:
import wptools

In [32]:
# get the E.T. page object
page = wptools.page('E.T._the_Extra-Terrestrial').get()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (query) E.T. the Extra-Terrestrial (&plcontinue=...
en.wikipedia.org (parse) 73441
www.wikidata.org (wikidata) Q11621
www.wikidata.org (labels) P5021|Q139184|Q258064|P646|Q4834543|Q22...
www.wikidata.org (labels) P3995|Q3953565|P2334|P18|Q1044183|P5008...
www.wikidata.org (labels) Q168383|Q4376972|Q60629803|P1712|P950|P...
www.wikidata.org (labels) P2518|Q900414|Q8395520|P3417|P5786|Q105...
www.wikidata.org (labels) P921|Q1315008|P2130|P437|Q499789|Q68608...
en.wikipedia.org (restbase) /page/summary/E.T._the_Extra-Terrestrial
en.wikipedia.org (imageinfo) File:E t the extra terrestrial ver3....
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  claims: <dict(129)> P1562, P57, P272, P345, P31, P161, P373, P48...
  description: 1982 American film
  exhtml: <str(485)> <p><i><b>E.T. the Extra-Terrestrial</b></i> i...
  exrest: <str(46

In [33]:
# accessing the image attribute will return the images for this page
page.data['image'][0]

{'kind': 'parse-image',
 'file': 'File:E t the extra terrestrial ver3.jpg',
 'orig': 'E t the extra terrestrial ver3.jpg',
 'timestamp': '2016-06-04T10:30:46Z',
 'size': 83073,
 'width': 253,
 'height': 394,
 'url': 'https://upload.wikimedia.org/wikipedia/en/6/66/E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
 'title': 'File:E t the extra terrestrial ver3.jpg',
 'metadata': {'DateTime': {'value': '2016-06-04 10:30:46',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'ObjectName': {'value': 'E t the extra terrestrial ver3',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'CommonsMetadataExtension': {'value': 1.2,
   'source': 'extension',
   'hidden': ''},
  'Categories': {'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of film posters|Files with no machine-readable author|Noindexed pages|Wik

In [34]:
page.data['infobox']

{'name': 'E.T. the Extra-Terrestrial',
 'image': 'E t the extra terrestrial ver3.jpg',
 'alt': 'The poster shows the planet earth, a child\'s finger touching E.T\'s finger, with a light blinking on contact. The top headline reads "His Adventure On Earth".',
 'caption': 'Theatrical release poster by [[John Alvin]]',
 'director': '[[Steven Spielberg]]',
 'producers': '{{unbulleted list|[[Kathleen Kennedy (producer)|Kathleen Kennedy]]|Steven Spielberg}}',
 'writer': '[[Melissa Mathison]]',
 'starring': '{{Plainlist|<!--Per poster billing-->|\n* [[Dee Wallace]]\n* [[Henry Thomas]]\n* [[Peter Coyote]]\n* [[Robert MacNaughton]]\n* [[Drew Barrymore]]}} * [[Dee Wallace]]\n* [[Henry Thomas]]\n* [[Peter Coyote]]\n* [[Robert MacNaughton]]\n* [[Drew Barrymore]]',
 'music': '[[John Williams]]',
 'cinematography': '[[Allen Daviau]]',
 'editing': '[[Carol Littleton]]',
 'studio': '[[Amblin Entertainment]]',
 'distributor': '[[Universal Pictures]]',
 'released': '{{Film date|1982|5|26|[[1982 Cannes Fi

In [35]:
page.data['infobox']['director']

'[[Steven Spielberg]]'

#### Downloading Files Programmatically using APIs and JSON

In [36]:
from PIL import Image
from io import BytesIO

In [37]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'The_Third_Man',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'The_Cabinet_of_Dr._Caligari',
 'All_About_Eve',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Modern_Times_(film)',
 'It_Happened_One_Night',
 "Singin'_in_the_Rain",
 'Boyhood_(film)',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Psycho_(1960_film)',
 'Laura_(1944_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'La_Grande_Illusion',
 'North_by_Northwest',
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'The_Maltese_Falcon_(1941_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'Sunset_Boulevard_(film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'The_Adventures_of_Robin_Hood',
 'Rashomon',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Bride_of_Frankenstein',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 'The_Philadelphia_Story_(film)',
 'Alien_(film)',
 'Bicycle_Thieves',
 'Seven_Samurai',
 'The_Treasure_of_the_Sierra_Madre_(film)',
 'Up_(2009_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Army_of_Shadows',
 'Arrival_(film)',
 'Baby_Driver',
 'A_Streetcar_Named_Desire_(1951_film)',
 'The_Night_of_the_Hunter_(film)',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'Frankenstein_(1931_film)',
 'Vertigo_(film)',
 'The_Dark_Knight_(film)',
 'Touch_of_Evil',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [38]:
folder_name = 'support-files/02_Gathering-Data/bestofrt_posters'

# make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [39]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # this cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        # your code here (three lines)
        images = page.get().data['image']
        # first image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
2
2_Citizen_Kane: cannot identify image file <_io.BytesIO object at 0x00000144548AAE00>
3
3_The_Third_Man: cannot identify image file <_io.BytesIO object at 0x000001445465BA40>
4
5
6
7
7_All_About_Eve: cannot identify image file <_io.BytesIO object at 0x00000144548A3F40>
8
9
10
10_Metropolis_(1927_film): cannot identify image file <_io.BytesIO object at 0x00000144548A3DB0>
11
12
13
13_It_Happened_One_Night: cannot identify image file <_io.BytesIO object at 0x00000144548A3360>
14
15
15_Boyhood_(film): 'image'
16
17
18
18_Psycho_(1960_film): cannot identify image file <_io.BytesIO object at 0x0000014453B45360>
19
19_Laura_(1944_film): cannot identify image file <_io.BytesIO object at 0x0000014454D45EA0>
20
21
22


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


22_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
23
24
24_North_by_Northwest: cannot identify image file <_io.BytesIO object at 0x0000014453C495E0>
25
26
27
27_The_Maltese_Falcon_(1941_film): cannot identify image file <_io.BytesIO object at 0x0000014453B45360>
28
29
30
31
31_Sunset_Boulevard_(film): cannot identify image file <_io.BytesIO object at 0x0000014453C32B80>
32
33
34
34_The_Adventures_of_Robin_Hood: cannot identify image file <_io.BytesIO object at 0x00000144538EB360>
35
35_Rashomon: cannot identify image file <_io.BytesIO object at 0x0000014453B5CB80>
36
37
38
39
40
41
42
43
43_Bride_of_Frankenstein: cannot identify image file <_io.BytesIO object at 0x0000014453C3F180>
44
45
46
47
47_The_Philadelphia_Story_(film): cannot identify image file <_io.Bytes

API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


72_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
73
74
75
76
77
78
78_Gone_with_the_Wind_(film): cannot identify image file <_io.BytesIO object at 0x000001445476E4A0>
79
80
81
82
82_Tokyo_Story: cannot identify image file <_io.BytesIO object at 0x0000014453ECB090>
83
84
85
86
87
88
88_High_Noon: cannot identify image file <_io.BytesIO object at 0x0000014453EB1AE0>
89
90
91
91_The_Wages_of_Fear: cannot identify image file <_io.BytesIO object at 0x000001445421FF40>
92
93
94
94_The_Grapes_of_Wrath_(film): cannot identify image file <_io.BytesIO object at 0x0000014454C62360>
95
95_Roman_Holiday: cannot identify image file <_io.BytesIO object at 0x0000014453D08220>
96
96_Man_on_Wire: cannot identify image file <_io.BytesIO object at 0x00000144538DB310>
97
98
99
100
100_Batt

In [40]:
for key in image_errors.keys():
    print(key)

2_Citizen_Kane
3_The_Third_Man
7_All_About_Eve
10_Metropolis_(1927_film)
13_It_Happened_One_Night
15_Boyhood_(film)
18_Psycho_(1960_film)
19_Laura_(1944_film)
22_A_Hard_Day%27s_Night_(film)
24_North_by_Northwest
27_The_Maltese_Falcon_(1941_film)
31_Sunset_Boulevard_(film)
34_The_Adventures_of_Robin_Hood
35_Rashomon
43_Bride_of_Frankenstein
47_The_Philadelphia_Story_(film)
50_Seven_Samurai
51_The_Treasure_of_the_Sierra_Madre_(film)
53_12_Angry_Men_(1957_film)
56_All_Quiet_on_the_Western_Front_(1930_film)
57_Army_of_Shadows
60_A_Streetcar_Named_Desire_(1951_film)
61_The_Night_of_the_Hunter_(film)
66_Vertigo_(film)
68_Touch_of_Evil
71_Rebecca_(1940_film)
72_Rosemary%27s_Baby_(film)
78_Gone_with_the_Wind_(film)
82_Tokyo_Story
88_High_Noon
91_The_Wages_of_Fear
94_The_Grapes_of_Wrath_(film)
95_Roman_Holiday
96_Man_on_Wire
100_Battleship_Potemkin


Unfortunately, web scraping is a moving target and this code won't completely work. Still, it was a good practice on using wptools!

In [41]:
# Inspect unidentifiable images and download them individually
for rank_title, images in image_errors.items():
    if rank_title == '3_The_Third_Man':
        title = 'The_Third_Man'
        url = 'https://upload.wikimedia.org/wikipedia/commons/7/77/The_Third_Man_%281949_American_theatrical_poster%29.jpg'
    if rank_title == '5_Mad_Max':
        title = 'Mad_Max'
        url = 'https://upload.wikimedia.org/wikipedia/en/5/5a/MadMazAus.jpg'
    if rank_title == '6_The_Cabinet_of_Dr._Caligari':
        title = 'The_Cabinet_of_Dr._Caligari'
        url = 'https://upload.wikimedia.org/wikipedia/en/2/2f/The_Cabinet_of_Dr._Caligari_poster.jpg'
    if rank_title == '7_All_About_Eve':
        title = 'All_About_Eve'
        url = 'https://upload.wikimedia.org/wikipedia/commons/a/a7/All_About_Eve_%281950_poster_-_retouch%29.jpg'
    if rank_title == '10_Metropolis_(1927_film)':
        title = 'Metropolis_(1927_film)'
        url = 'https://upload.wikimedia.org/wikipedia/en/9/97/Metropolis_%28German_three-sheet_poster%29.jpg'
    if rank_title == '13_It_Happened_One_Night':
        title = "It_Happened_One_Night"
        url = 'https://upload.wikimedia.org/wikipedia/commons/d/dc/It-happened-one-night-poster.jpg'
    if rank_title == "14_Singin'_in_the_Rain":
        title = "Singin'_in_the_Rain"
        url = 'https://upload.wikimedia.org/wikipedia/commons/5/5d/Singin%27_in_the_Rain_%281952_poster%29.jpg'
    if rank_title == '15_Boyhood_(film)':
        title = "Boyhood_(film)"
        url = 'https://upload.wikimedia.org/wikipedia/en/a/a6/Boyhood_%282014%29.png'
    if rank_title == '19_Laura_(1944_film)':
        title = "Laura_(1944_film)"
        url = 'https://upload.wikimedia.org/wikipedia/commons/3/30/Laura_%281944_film_poster%29.jpg'
    if rank_title == '22_A_Hard_Day%27s_Night_(film)':
        title = "A_Hard_Day%27s_Night_(film)"
        url = 'https://upload.wikimedia.org/wikipedia/en/4/47/A_Hard_Days_night_movieposter.jpg'
    if rank_title == '24_North_by_Northwest':
        title = 'North_by_Northwest'
        url = 'https://upload.wikimedia.org/wikipedia/commons/8/83/Northbynorthwest1.jpg'
    if rank_title == '27_The_Maltese_Falcon_(1941_film)':
        title = 'The_Maltese_Falcon_(1941_film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/6/6b/The_Maltese_Falcon_%281941_film_poster%29.jpg'
    if rank_title == '31_Sunset_Boulevard_(film)':
        title = 'Sunset_Boulevard_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg'
    if rank_title == '34_The_Adventures_of_Robin_Hood':
        title = 'The_Adventures_of_Robin_Hood'
        url = 'https://upload.wikimedia.org/wikipedia/commons/f/f7/The_Adventures_of_Robin_Hood_%281938_poster%29.jpg'
    if rank_title == '35_Rashomon':
        title = 'Rashomon'
        url = 'https://upload.wikimedia.org/wikipedia/commons/a/a3/Rashomon_poster_3.jpg'
    if rank_title == '36_Rear_Window':
        title = 'Rear_Window'
        url = 'https://upload.wikimedia.org/wikipedia/commons/3/38/Rear_Window_film_poster.jpg'
    if rank_title == '43_Bride_of_Frankenstein':
        title = 'Bride_of_Frankenstein'
        url = 'https://upload.wikimedia.org/wikipedia/commons/5/58/The_Bride_of_Frankenstein_%281935_poster%29.jpg'
    if rank_title == '47_The_Philadelphia_Story_(film)':
        title = 'The_Philadelphia_Story_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/5/54/The-Philadelphia-Story-%281940%29.jpg'
    if rank_title == '50_Seven_Samurai':
        title = 'Seven_Samurai'
        url = 'https://upload.wikimedia.org/wikipedia/commons/b/ba/Seven_Samurai_poster.jpg'
    if rank_title == '51_The_Treasure_of_the_Sierra_Madre_(film)':
        title = 'The_Treasure_of_the_Sierra_Madre_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/1/1d/The_Treasure_of_the_Sierra_Madre_%281947_poster%29.jpg'
    if rank_title == '53_12_Angry_Men_(1957_film)':
        title = '12_Angry_Men_(1957_film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/b/b5/12_Angry_Men_%281957_film_poster%29.jpg'
    if rank_title == '56_All_Quiet_on_the_Western_Front_(1930_film)':
        title = 'All_Quiet_on_the_Western_Front_(1930_film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/6/6c/All_Quiet_on_the_Western_Front_%281930_film%29_poster.jpg'
    if rank_title == '60_A_Streetcar_Named_Desire_(1951_film)':
        title = 'A_Streetcar_Named_Desire_(1951_film)'
        url = 'https://upload.wikimedia.org/wikipedia/en/6/66/StreetcarNamedDesire.JPG'
    if rank_title == '61_The_Night_of_the_Hunter_(film)':
        title = 'The_Night_of_the_Hunter_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/a/a5/The_Night_of_the_Hunter_%281955_poster%29.jpg'
    if rank_title == '62_Star_Wars':
        title = 'Star_Wars'
        url = 'https://upload.wikimedia.org/wikipedia/en/8/87/StarWarsMoviePoster1977.jpg'
    if rank_title == '64_Dr._Strangelove':
        title = 'Dr._Strangelove'
        url = 'https://en.wikipedia.org/wiki/Dr._Strangelove#/media/File:Dr._Strangelove_poster.jpg'
    if rank_title == '66_Vertigo_(film)':
        title = 'Vertigo_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/7/75/Vertigomovie_restoration.jpg'
    if rank_title == '68_Touch_of_Evil':
        title = 'Touch_of_Evil'
        url = 'https://upload.wikimedia.org/wikipedia/commons/0/09/Touch_of_Evil_%281958_poster%29.jpg'
    if rank_title == '72_Rosemary%27s_Baby_(film)':
        title = 'Rosemary%27s_Baby_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/en/e/ef/Rosemarys_baby_poster.jpg'
    if rank_title == '78_Gone_with_the_Wind_(film)':
        title = 'Gone_with_the_Wind_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/2/27/Poster_-_Gone_With_the_Wind_01.jpg'
    if rank_title == '82_Tokyo_Story':
        title = 'Tokyo_Story'
        url = 'https://upload.wikimedia.org/wikipedia/commons/1/1b/Tokyo-story-20201121.jpg'
    if rank_title == '90_On_the_Waterfront':
        title = 'On_the_Waterfront'
        url = 'https://upload.wikimedia.org/wikipedia/commons/e/ee/On_the_Waterfront_%281954_poster%29.jpg'
    if rank_title == '94_The_Grapes_of_Wrath_(film)':
        title = 'The_Grapes_of_Wrath_(film)'
        url = 'https://upload.wikimedia.org/wikipedia/commons/c/c1/The_Grapes_of_Wrath_%281940_poster%29.jpg'
    if rank_title == '95_Roman_Holiday':
        title = 'Roman_Holiday'
        url = 'https://upload.wikimedia.org/wikipedia/commons/d/d0/Roman_Holiday_%281953_poster%29.jpg'
    if rank_title == '100_Battleship_Potemkin':
        title = 'Battleship_Potemkin'
        url = 'https://upload.wikimedia.org/wikipedia/commons/8/85/Vintage_Potemkin.jpg'
    
    title = rank_title.split('_', 1)[1]
    if title not in df_list:
        df_list.append({'ranking': int(title_list.index(title) + 1),
                        'title': title,
                        'poster_url': url})

#     r = requests.get(url)
#     # Download movie poster image
#     i = Image.open(BytesIO(r.content))
#     image_file_format = url.split('.')[-1]
#     i.save(folder_name + "/" + rank_title + '.' + image_file_format)

In [42]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)

In [43]:
df.head()

Unnamed: 0,ranking,title,poster_url
0,1,The_Wizard_of_Oz_(1939_film),https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen_Kane,https://d17h27t6h515a5.cloudfront.net/topher/2...
2,3,The_Third_Man,https://upload.wikimedia.org/wikipedia/commons...
3,4,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
4,5,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   ranking     100 non-null    int64 
 1   title       100 non-null    object
 2   poster_url  100 non-null    object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB


### Storing Data
In order to create the word cloud visualizations, we'd need the assessing and cleaning steps to be done too. However, they are the scope of the next two lessons. For now, Udacity provided the word cloud visualizations and the combined dataset to download until I have the necessary skills to generate them myself!

### Relational Databases in Python
Data Wrangling and Relational Databases
In the context of data wrangling, we recommend that databases and SQL only come into play for gathering data or storing data. That is:

- Storing data **from** a pandas DataFrame **in** a database to which you're connected, and
- Importing data **from** a database to which you're connected **to** a pandas DataFrame

Imagine the next cells contains all of the gathering code from this entire lesson, plus the assessing and cleaning code done behind the scenes, and that the final product is a merged master DataFrame called *df*.

In [45]:
df = pd.read_csv('support-files/02_Gathering-Data/bestofrt_master.csv')
df.head()

Unnamed: 0,ranking,title,critic_score,number_of_critic_ratings,audience_score,number_of_audience_ratings,review_url,review_text,poster_url
0,1,The Wizard of Oz (1939),99,110,89,874425,http://www.rogerebert.com/reviews/great-movie-...,As a child I simply did not notice whether a m...,https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen Kane (1941),100,75,90,157274,http://www.rogerebert.com/reviews/great-movie-...,“I don't think any word can explain a man's li...,https://upload.wikimedia.org/wikipedia/en/c/ce...
2,3,The Third Man (1949),100,77,93,53081,http://www.rogerebert.com/reviews/great-movie-...,Has there ever been a film where the music mor...,https://upload.wikimedia.org/wikipedia/en/2/21...
3,4,Get Out (2017),99,282,87,63837,http://www.rogerebert.com/reviews/get-out-2017,"With the ambitious and challenging “Get Out,” ...",https://upload.wikimedia.org/wikipedia/en/e/eb...
4,5,Mad Max: Fury Road (2015),97,370,86,123937,http://www.rogerebert.com/reviews/mad-max-fury...,George Miller’s “Mad Max” films didn’t just ma...,https://upload.wikimedia.org/wikipedia/en/6/6e...


### 1. Connect to a database

In [46]:
from sqlalchemy import create_engine

In [47]:
# Create SQLAlchemy Engine and empty bestofrt database
# bestofrt.db will not show up in the Jupyter Notebook dashboard yet
engine = create_engine('sqlite:///support-files/02_Gathering-Data/bestofrt.db')

### 2. Store pandas DataFrame in database
Store the data in the cleaned master dataset (bestofrt_master) in that database.

In [48]:
# Store cleaned master DataFrame ('df') in a table called master in bestofrt.db
# bestofrt.db will be visible now in the Jupyter Notebook dashboard
df.to_sql('master', engine, index=False)

89

### 3. Read database data into a pandas DataFrame
Read the brand new data in that database back into a pandas DataFrame.

In [49]:
df_gather = pd.read_sql('SELECT * FROM master', engine)

In [50]:
df_gather.head()

Unnamed: 0,ranking,title,critic_score,number_of_critic_ratings,audience_score,number_of_audience_ratings,review_url,review_text,poster_url
0,1,The Wizard of Oz (1939),99,110,89,874425,http://www.rogerebert.com/reviews/great-movie-...,As a child I simply did not notice whether a m...,https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen Kane (1941),100,75,90,157274,http://www.rogerebert.com/reviews/great-movie-...,“I don't think any word can explain a man's li...,https://upload.wikimedia.org/wikipedia/en/c/ce...
2,3,The Third Man (1949),100,77,93,53081,http://www.rogerebert.com/reviews/great-movie-...,Has there ever been a film where the music mor...,https://upload.wikimedia.org/wikipedia/en/2/21...
3,4,Get Out (2017),99,282,87,63837,http://www.rogerebert.com/reviews/get-out-2017,"With the ambitious and challenging “Get Out,” ...",https://upload.wikimedia.org/wikipedia/en/e/eb...
4,5,Mad Max: Fury Road (2015),97,370,86,123937,http://www.rogerebert.com/reviews/mad-max-fury...,George Miller’s “Mad Max” films didn’t just ma...,https://upload.wikimedia.org/wikipedia/en/6/6e...


### Data Wrangling in SQL?
Data wrangling can actually be performed in SQL. We believe that pandas is better equipped for gathering (pandas has a huge simplicity advantage in this area), assessing, and cleaning data, so we usually recommend that you use pandas if given the choice. If wrangling in a work setting, sometimes your tool of choice for data wrangling depends on your company infrastructure, though.

Here is an interesting [Reddit thread that debates pandas vs. SQL](https://www.reddit.com/r/Python/comments/1tqjt4/why_do_you_use_pandas_instead_of_sql/) in general and touches on several topics related to data wrangling.

### Summary
Gathering is the first step in the data wrangling process:

1. **Gather**
2. Assess
3. Clean

Depending on the source of your data, and what format it's in, the steps in gathering data vary.

The high-level gathering process:

- Obtaining data (downloading a file from the internet, scraping a web page, querying an API, etc.)
- Importing that data into your programming environment (e.g. Jupyter Notebook)