## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with this week. The HTML page on the BBC site (mirrored on my site) poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies that immediately follow them. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just DM me on Slack and I will help you!)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, Slack me and I will help  get your code working so you can move on to the next step.



### Getting started: Data Architecture

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible.

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on.

You will eventually want to bring this into pandas so you want to keep your table simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page.

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

Okay let's begin!

STEP 1:


In [1]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import re
import requests
from bs4 import BeautifulSoup

In [2]:
# read the URL, and put the HTML page into beautiful soup

my_url = "https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted"
raw_html = requests.get(my_url).content
soup = BeautifulSoup(raw_html, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <title>
   The 21st Century’s 100 greatest films: Who voted?
  </title>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" property="og:title"/>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" name="twitter:title"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="description"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." property="og:description"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="twitter:description"/>
  <meta content="https://ychef.files.bbci.co.uk/624x351/p04548r6.jpg" property="og:image"/>
  <meta content="https://ychef.files.bbci.co.uk/624x351/p04548r6.jpg" name="twitter:image:src"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="#da532c" name="msapplicat

In [3]:
#Using beautiful soup find the div tag that contains
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information

all_info = soup.find("div", id="__next")
#print(all_info.text)

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [4]:
#find_all

all_p = all_info.find_all("p")
for p in all_p:
   print(p.text)

We polled 177 critics from around the world – here is how they voted.
Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.
Simon Abrams – Freelance film critic (US)
1. Mulholland Drive (David Lynch, 2001)
2. In the Mood for Love (Wong Kar-wai, 2000)
3. The Tree of Life (Terrence Malick, 2011)
4. Yi Yi: A One and a Two (Edward Yang, 2000)
5. Goodbye to Language (Jean-Luc Godard, 2014)
6. The White Meadows (Mohammad Rasoulof, 2

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies.

Critics should not be too hard. But in order to get the movies attached to that critic you need to be smart about your beautiful soup method.

So, you need to build a loop that searches to your list of all `<p>` elements:


As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece.



In [5]:
##Write your loop for STEP 3 here
#I started this for you,

# Critics part
# print(all_p[2].find('b').text)

# #movie part
# print(all_p[3].text)

# print(all_p[12].text)

# print (len(all_p))
# print (all_p[1948])

max_iter = 1948;

#for lines in all_p[2:1948]:
        #manage your HTML tags to isolate
        #critic_info = ???
        #movie_info = ???

start_index = 2  # Start index (inclusive)
end_index = 1948    # End index (exclusive)
jump_index = 11

critic_info = []
movie_info = []

for iter in range(start_index, end_index, jump_index):
  critic_info.append(all_p[iter].find('b').text)
  #print (critic_info)
  for sub_iter in range(1,11):
    movie_info.append(all_p[iter+sub_iter].text)
    #print (movie_info)

print(critic_info)
print (movie_info)

['Simon Abrams – Freelance film critic (US)', 'Sam Adams – Freelance film critic (US)', 'Thelma Adams – Freelance film critic (US)', 'Arturo Aguilar – Rolling Stone Mexico (Mexico)', 'Matthew Anderson – BBC Culture (UK)', 'Tim Appelo – The Wrap (US)', 'Adriano Aprà – Film historian (Italy)', 'Michael Arbeiter – Nerdist (US)', 'Ali Arikan – Dipnot TV (Turkey)', 'Michael Atkinson – The Village Voice (US)', 'Ana Maria Bahiana – Freelance film critic (Brazil)', 'Cameron Bailey – Toronto Film Festival (Canada)', 'Lindsay Baker – BBC Culture (UK)', 'Miriam Bale – Freelance film critic (US)', 'Nicholas Barber – BBC Culture (UK)', 'Diego Batlle – La Nacion (Argentina)', 'NT Binh – Positif (France)', 'Lizelle Bisschoff – University of Glasgow (UK)', 'Christian Blauvelt – BBC Culture (US)', 'Mahen Bonetti – African Film Festival Inc (US)', 'Andreas Borcholte – Spiegel Online (Germany)', 'Utpal Borpujari – Freelance film critic (India)', 'Richard Brody – The New Yorker (US)', 'Hannah Brown – Jeru

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [6]:
#Practice/Build your regular expressions here
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
# regex_for_name = r"^(\w+ \w+)"
# regex_for_org = r""
# regex_for_cn = r"\(\w*\)$"
# name = re.findall(regex_for_name,crit_sample)
# name[0]

# Regex
pattern = r'^(.*?) – (.*?) \((.*?)\)$'

matches = re.findall(pattern, crit_sample, re.MULTILINE)
for match in matches:
    critic_name, organization, country = match
    print(f'Critic Name: {critic_name.strip()}')
    print(f'Organization: {organization.strip()}')
    print(f'Country: {country.strip()}')
    print('----')

Critic Name: Arturo Aguilar
Organization: Rolling Stone Mexico
Country: Mexico
----


In [9]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it

critic_names = []
critic_orgs = []
critic_countries = []

for iter in range(start_index, end_index, jump_index):
  this_critic = all_p[iter].find('b').text
  pattern = r'^(.*?) – (.*?) \((.*?)\)$'
  matches = re.findall(pattern, this_critic, re.MULTILINE)
  for match in matches:
    critic_name, organization, country = match
    critic_names.append(critic_name.strip())
    critic_orgs.append(organization.strip())
    critic_countries.append(country.strip())
  #print (critic_info)
  for sub_iter in range(1,11):
    movie_info.append(all_p[iter+sub_iter].text)
    #print (movie_info)

print(critic_names, critic_orgs, critic_countries)
print (movie_info)

['Simon Abrams', 'Sam Adams', 'Thelma Adams', 'Arturo Aguilar', 'Matthew Anderson', 'Tim Appelo', 'Adriano Aprà', 'Michael Arbeiter', 'Ali Arikan', 'Michael Atkinson', 'Ana Maria Bahiana', 'Cameron Bailey', 'Lindsay Baker', 'Miriam Bale', 'Nicholas Barber', 'Diego Batlle', 'NT Binh', 'Lizelle Bisschoff', 'Christian Blauvelt', 'Mahen Bonetti', 'Andreas Borcholte', 'Utpal Borpujari', 'Richard Brody', 'Hannah Brown', 'Luke Buckmaster', 'Luciano Castillo', 'Monica Castillo', 'Samuel Castro', 'Justin Chang', 'Enrico Chiesa', 'Cho Seongyong', 'Robbie Collin', 'Scott Collura', 'Colin Covert', 'Oggs Cruz', 'Ken Dancyger', 'Erik Davis', 'Peter Debruge', 'Fernand Denis', 'Lindiwe Dovey', 'Alonso Duralde', 'Bilge Ebiri', 'David Ehrlich', 'Kate Erbland', 'Mario Espinosa', 'Joseph Fahim', 'Devin Faraci', 'David Fear', 'Scott Feinberg', 'Javier Porta Fouz', 'Kenji Fujishima', 'Ernesto Garratt', 'Steven Gaydos', 'Noah Gittell', 'Owen Gleiberman', 'Ed Gonzalez', 'Juan Carlos González', 'Carmen Gray', 

**STEP 5**
Now you need to get your **movie info**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. See our old scraping homeworks for how to get a list of each movie entry--which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [8]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [7]:
#Practice/Build your regular expressions here
# movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
# movie_sample = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
movie_sample = "5. Madagascar 3: Europe's Most Wanted (Eric Darnell, Tom McGrath and Conrad Vernon, 2012)"
# regex_for_mname = r""
# movie_name = re.findall(regex_for_mname,movie_sample)
# movie_name[0]

pattern = r'^\d+\.\s(.*?) \((.*?)\)$'
matches = re.findall(pattern, movie_sample, re.MULTILINE)
for match in matches:
    movie_name, director_year = match
    director_year_list = director_year.split(', ')

    director_name = ', '.join(director_year_list[:-1])
    year = director_year_list[-1]

    print(f'Movie Name: {movie_name}')
    print(f'Director Name(s): {director_name}')
    print(f'Year: {year}')

Movie Name: Madagascar 3: Europe's Most Wanted
Director Name(s): Eric Darnell, Tom McGrath and Conrad Vernon
Year: 2012


**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the movie name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional, but maybe helpful to keep)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [8]:
#Get that loop working here

critic_names = []
critic_orgs = []
critic_countries = []
movie_names = []
movie_years = []
directors = []

for iter in range(start_index, end_index, jump_index):
  this_critic = all_p[iter].find('b').text
  pattern = r'^(.*?) – (.*?) \((.*?)\)$'
  matches = re.findall(pattern, this_critic, re.MULTILINE)
  for match in matches:
    critic_name, organization, country = match
    critic_names.append(critic_name.strip())
    critic_orgs.append(organization.strip())
    critic_countries.append(country.strip())
  #print (critic_info)
  for sub_iter in range(1,11):
    #movie_info.append(all_p[iter+sub_iter].text)
    #print (movie_info)
    pattern = r'^\d+\.\s(.*?) \((.*?)\)$'
    this_movie = all_p[iter+sub_iter].text.strip()
    matches = re.findall(pattern, this_movie, re.MULTILINE)
    for match in matches:
        movie_name, director_year = match
        director_year_list = director_year.split(', ')
        director_name = ', '.join(director_year_list[:-1])
        year = director_year_list[-1]
        movie_names.append(movie_name.strip())
        movie_years.append(year.strip())
        directors.append(director_name.strip())


#print(critic_names, critic_orgs, critic_countries)
print (directors, movie_years, movie_names)

['David Lynch', 'Wong Kar-wai', 'Terrence Malick', 'Edward Yang', 'Jean-Luc Godard', 'Mohammad Rasoulof', 'Raoul Ruiz', 'Abbas Kiarostami', 'Johnnie To', 'Carlos Saura', 'Wong Kar-wai', 'Michel Gondry', 'Apichatpong Weerasethakul', 'Hayao Miyazaki', 'Joshua Oppenheimer', 'Wes Anderson', 'Terrence Malick', 'Abbas Kiarostami', 'Jia Zhangke', 'Gus Van Sant', 'Kathryn Bigelow', 'David Cronenberg', 'Wes Anderson', 'Sarah Polley', 'Martin Campbell', 'Michel Gondry', 'Miguel Gomes', 'Pablo Berger', 'Courtney Hunt', 'Robert Altman', 'Wong Kar-wai', 'David Lynch', 'Christopher Nolan', 'Guillermo Del Toro', 'Michael Haneke', 'Werner Herzog', 'Cristian Mungiu', 'Leos Carax', 'Claude Lanzmann', 'Paul Thomas Anderson', 'Michael Haneke', 'Kenneth Lonergan', 'Mary Harron', 'Cristian Mungiu', 'Michael Haneke', 'David Lynch', 'Jessica Hausner', 'Andrea Arnold', 'Richard Linklater', 'Pablo Larraín', 'Joel and Ethan Coen', 'Hayao Miyazaki', 'Asghar Farhadi', 'Guillermo Del Toro', 'Andrew Stanton and Lee 

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [9]:
#figure out how you're going to collect your clean information
#list_of_what = []

#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info


#You will want to build a list that gets appended to list_of_what
#Try to figure out how you want to append things
#That is, how you want to organize your data



print (len(critic_names), len (critic_orgs), len (critic_countries))
print (len(movie_names), len (movie_years), len (directors))
#print (directors)

177 177 177
1770 1770 1770


In [12]:
##Take a peek at your final lists of lists

# critic_names = [] (size 177)
# critic_orgs = [] (size 177)
# critic_countries = [] (size 177)
# movie_names = [](size 1770)
# movie_years = [](size 1770)
# directors = [](size 1770)

# import numpy as np
# import pandas as pd
# #col_names = ['director']
# df = pd.DataFrame(directors, columns=['Director'])
# df.head(20)


If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [10]:
#Scrape of Wikipedia
dir_dob = []
dir_pob = []
for d in directors:
  dir_names =  d.split()
  query_str = dir_names[0]
  for dn in dir_names[1:]:
    query_str = query_str+"_"+dn

  #print (query_str)
  my_url = "https://en.wikipedia.org/wiki/"+query_str
  print (my_url)
  raw_html = requests.get(my_url).content
  soup = BeautifulSoup(raw_html, "html.parser")
  #print(soup.prettify())
# Find the line of text starting with " (born " after a </b> tag; Find the table with class "infobox biography vcard"
  target_cell = soup.find('td', class_='infobox-data')

  if target_cell:
      text = target_cell.get_text().strip()  
      #print(text)
      dob_pattern = r"\(\d{4}\-\d{2}\-\d{2}\)"
      # ['Shanghai, China', 'Ottawa, Illinois, U.S.', 'November 6, 1947Shanghai, Republic of China', '3 December 1930Paris, France', 'Shiraz, Imperial State of Iran', '25 July 1941Puerto Montt, Chile', '22 June 1940Tehran, Imperial State of Iran', 'British Hong Kong', '4 January 1932Huesca, Spain']
      pob_pattern = r"\)[\w ,.]+$"
      dob_match = re.search(dob_pattern, text)
      if (dob_match):
          dob = dob_match.group()
          #print (dob.lstrip('(').rstrip(')'))
          dir_dob.append(dob.lstrip('(').rstrip(')'))
      else:
          #print ("None")
          dir_dob.append("None")
      pob_match = re.search(pob_pattern, text)
      if (pob_match):
          pob = pob_match.group().strip()
          #print (pob.lstrip(')'))

          # Second level; Define the regex pattern
          pattern = r'\d{4}(.+)$'  # Matches text in parentheses starting with four digits

          # Find the first match for the pattern in the sample text
          match = re.search(pattern, pob, re.MULTILINE)

          if match:
              matched_text = match.group(1)
              #print("Text that matches the pattern within parentheses:", matched_text)
              dir_pob.append(matched_text.strip())
          else:
              #print("No second match")
              dir_pob.append(pob.lstrip(')'))
      else:
          #print ("None")
          dir_pob.append("None")
  else:
      #print("Table with class 'infobox biography vcard' not found on the page.")
      dir_dob.append("None")
      dir_pob.append("None")

len(dir_dob)
len(dir_pob)

#print (dir_dob)
#print (dir_pob)


# critic_names = [] (size 177)
# critic_orgs = [] (size 177)
# critic_countries = [] (size 177)
# movie_names = [](size 1770)
# movie_years = [](size 1770)
# directors = [](size 1770)
# dir_dob = [] (size 1770)
# dir_pob = [] (size 1770)

https://en.wikipedia.org/wiki/David_Lynch
https://en.wikipedia.org/wiki/Wong_Kar-wai
https://en.wikipedia.org/wiki/Terrence_Malick
https://en.wikipedia.org/wiki/Edward_Yang
https://en.wikipedia.org/wiki/Jean-Luc_Godard
https://en.wikipedia.org/wiki/Mohammad_Rasoulof
https://en.wikipedia.org/wiki/Raoul_Ruiz
https://en.wikipedia.org/wiki/Abbas_Kiarostami
https://en.wikipedia.org/wiki/Johnnie_To
https://en.wikipedia.org/wiki/Carlos_Saura
https://en.wikipedia.org/wiki/Wong_Kar-wai
https://en.wikipedia.org/wiki/Michel_Gondry
https://en.wikipedia.org/wiki/Apichatpong_Weerasethakul
https://en.wikipedia.org/wiki/Hayao_Miyazaki
https://en.wikipedia.org/wiki/Joshua_Oppenheimer
https://en.wikipedia.org/wiki/Wes_Anderson
https://en.wikipedia.org/wiki/Terrence_Malick
https://en.wikipedia.org/wiki/Abbas_Kiarostami
https://en.wikipedia.org/wiki/Jia_Zhangke
https://en.wikipedia.org/wiki/Gus_Van_Sant
https://en.wikipedia.org/wiki/Kathryn_Bigelow
https://en.wikipedia.org/wiki/David_Cronenberg
https://en

1770

In [14]:
import csv
# CSV file 
with open('movie_data.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)

    # Each critic
    for i in range(len(critic_names)):
        critic = critic_names[i]
        org = critic_orgs[i]
        country = critic_countries[i]

        #Each movie for the current critic
        for j in range(10):
            movie = movie_names[i * 10 + j]
            year = movie_years[i * 10 + j]
            director = directors[i * 10 + j]
            dob = dir_dob[i * 10 + j]
            pob = dir_pob[i * 10 + j]

            # Write a row to CSV 
            writer.writerow([critic, org, country, movie, year, director, dob, pob])

print("CSV file created successfully.")

CSV file created successfully.


In [8]:
# none_count = 0
# with open('top_director_cob.csv', 'r', newline='', encoding='utf-8') as csvfile:
#     reader = csv.reader(csvfile)
#     for row in reader:
#         director_cob = row[0]  
#         if director_cob == "cob":
#            continue
#             #print (director_pob)
#         else:
#             print (director_cob)
#             keys = {'address': director_cob, 'key': api_key}
# #          print (keys)
#             r = requests.get(url,params=keys)
#             results_dic.append(r.json())

# #print (none_count)
# print (len(results_dic))
# # print (len(results_dic) + none_count)

Argentina
Australia
Austria
Belgium
Brazil
British Hong Kong
Canada
Chile
China
D.C.
Denmark
Ecuador
Empire of Japan
England
Finland
France
French Third Republic
French West Africa
Germany
Ghana
Greece
Holon
Hungary
Imperial State of Iran
India
Iran
Israel
Italy
Japan
Kenya
Kingdom of Italy
Kingdom of Portugal
Lithuania
Mauritania
Mexico
New Zealand
Northern Ireland
Pahlavi Iran
Palestine
Philippines
Poland
Portugal
Puerto Rico
Republic of China
Romania
Russian SFSR
SFR Yugoslavia
Scotland
Second Polish Republic
South Africa
South Korea
Soviet Union
Spain
Sweden
Syria
Taiwan
Tunisia
Turkey
U.K.
U.S.
US
United States
United StatesBrian TaylorUnited States
West Africa.
West Germany
Widnes
66
