## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with this week. The HTML page on the BBC site (mirrored on my site) poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies that immediately follow them. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just DM me on Slack and I will help you!)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, Slack me and I will help  get your code working so you can move on to the next step.


### Getting started: Data Architecture

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

You will eventually want to bring this into pandas so you want to keep your table simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page. 

http://floatingmedia.com/columbia/BBC.html

Okay let's begin!

STEP 1:


In [1]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import numpy as np
import requests
import re
from bs4 import BeautifulSoup

In [2]:
# read the URL, and put the HTML page into beautiful soup
my_url = "http://floatingmedia.com/columbia/BBC.html"
raw_html = requests.get(my_url).content

In [3]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <title>
   BBC - Culture - The 21st Century’s 100 greatest films: Who voted?
  </title>
  <meta content="story, STORY, story, image, the-100-greatest-films-of-the-21st-century, " name="keywords"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="description"/>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" property="og:title">
   <meta content="article" property="og:type">
    <meta content="http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted" property="og:url">
     <meta content="We polled 177 critics from around the world – here is how they voted." property="og:description">
      <meta content="summary_large_image" name="twitter:card"/>
      <meta content="@BBC_Culture" name="twitter:site"/>
      <meta content="The 21st Century’s 100 greatest fil

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [4]:
#find_all
all_p = soup_doc.find_all('p')
all_p

[<p style="position: absolute; top: -999em"><img alt="" height="1" src="//sa.bbc.co.uk/bbc/bbc/s?name=culture.story.20160819-the-21st-centurys-100-greatest-films-who-voted.page&amp;ml_name=webmodule&amp;ml_version=65&amp;blq_js_enabled=0&amp;blq_s=4d&amp;blq_r=2.7&amp;blq_v=default&amp;blq_e=pal&amp;pal_route=webserviceapi&amp;app_type=responsive&amp;language=en-GB&amp;pal_webapp=barlesque&amp;prod_name=frameworks&amp;app_name=frameworks" width="1"/></p>,
 <p class="introduction">We polled 177 critics from around the world – here is how they voted.</p>,
 <p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the U

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step Slack me with your code!



In [5]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
for line in all_p:
    if line.strong is not None:
        critic_info = line.strong.string
        movie_info = line.next_sibling
        print(critic_info)
        movie_list = movie_info.find_all(string=True)
        for movie in movie_list:
            print (movie)

Simon Abrams – Freelance film critic (US)
1. Mulholland Drive (David Lynch, 2001)
2. In the Mood for Love (Wong Kar-wai, 2000)
3. The Tree of Life (Terrence Malick, 2011)
4. Yi Yi: A One and a Two (Edward Yang, 2000)
5. Goodbye to Language (Jean-Luc Godard, 2014)
6. The White Meadows (Mohammad Rasoulof, 2009)
7. Night Across the Street (Raoul Ruiz, 2012)
8. Certified Copy (Abbas Kiarostami, 2010)
9. Sparrow (Johnnie To, 2008)
10. Fados (Carlos Saura, 2007)
Sam Adams – Freelance film critic (US)
1. In the Mood for Love (Wong Kar-wai, 2000)
2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
4. Spirited Away (Hayao Miyazaki, 2001)
5. The Act of Killing (Joshua Oppenheimer, 2012)
6. The Grand Budapest Hotel (Wes Anderson, 2014)
7. The New World (Terrence Malick, 2004)
8. Certified Copy (Abbas Kiarostami, 2010)
9. The World (Jia Zhangke, 2004)
10. Elephant (Gus Van Sant, 2003)
Thelma Adams – Freelance film critic (US)


**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [6]:
import re

In [7]:
# Practice/Build your regular expressions here
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^[\s\w-]+ –"
regex_for_org = r"(?<=\–).+?(?=\()"
regex_for_cn = r"(?<=\().+?(?=\))"
name = re.findall(regex_for_name,crit_sample)[0]
org = re.findall(regex_for_org,crit_sample)[0]
cn = re.findall(regex_for_cn,crit_sample)[0]
print(name,org,cn)

Arturo Aguilar –  Rolling Stone Mexico  Mexico


In [8]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it
# generating regex for the name: 
# regex_for_name = r"^[w\s\w]+ –"

for line in all_p:
    if line.strong is not None and line.strong.string != "More on BBC Culture’s 100 greatest films of the 21st Century:":
        critic_info = line.strong.string
        name = re.findall(regex_for_name,critic_info)[0] 
        org = re.findall(regex_for_org,critic_info)[0] 
        cn = re.findall(regex_for_cn,critic_info)[0] 
        print(name,org,cn)         

Simon Abrams –  Freelance film critic  US
Sam Adams –  Freelance film critic  US
Thelma Adams –  Freelance film critic  US
Arturo Aguilar –  Rolling Stone Mexico  Mexico
Matthew Anderson –  BBC Culture  UK
Tim Appelo –  The Wrap  US
Adriano Aprà –  Film historian  Italy
Michael Arbeiter –  Nerdist  US
Ali Arikan –  Dipnot TV  Turkey
Michael Atkinson –  The Village Voice  US
Ana Maria Bahiana –  Freelance film critic  Brazil
Cameron Bailey –  Toronto Film Festival  Canada
Lindsay Baker –  BBC Culture  UK
Miriam Bale –  Freelance film critic  US
Nicholas Barber –  BBC Culture  UK
Diego Batlle –  La Nacion  Argentina
NT Binh –  Positif  France
Lizelle Bisschoff –  University of Glasgow  UK
Christian Blauvelt –  BBC Culture  US
Mahen Bonetti –  African Film Festival Inc  US
Andreas Borcholte –  Spiegel Online  Germany
Utpal Borpujari –  Freelance film critic  India
Richard Brody –  The New Yorker  US
Hannah Brown –  Jerusalem Post  Israel
Luke Buckmaster –  The Guardian/BBC Culture  Austra

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [9]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie
for line in all_p:
    if line.strong is not None:
        movie_info = line.next_sibling
        movie_list = movie_info.find_all(string=True)
        for movie in movie_list:
            print(movie)

1. Mulholland Drive (David Lynch, 2001)
2. In the Mood for Love (Wong Kar-wai, 2000)
3. The Tree of Life (Terrence Malick, 2011)
4. Yi Yi: A One and a Two (Edward Yang, 2000)
5. Goodbye to Language (Jean-Luc Godard, 2014)
6. The White Meadows (Mohammad Rasoulof, 2009)
7. Night Across the Street (Raoul Ruiz, 2012)
8. Certified Copy (Abbas Kiarostami, 2010)
9. Sparrow (Johnnie To, 2008)
10. Fados (Carlos Saura, 2007)
1. In the Mood for Love (Wong Kar-wai, 2000)
2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
4. Spirited Away (Hayao Miyazaki, 2001)
5. The Act of Killing (Joshua Oppenheimer, 2012)
6. The Grand Budapest Hotel (Wes Anderson, 2014)
7. The New World (Terrence Malick, 2004)
8. Certified Copy (Abbas Kiarostami, 2010)
9. The World (Jia Zhangke, 2004)
10. Elephant (Gus Van Sant, 2003)
1. Zero Dark Thirty (Kathryn Bigelow, 2012)
2. A History of Violence (David Cronenberg, 2005)
3. The Grand Budapest Hotel (

6. Caché (Michael Haneke, 2005)
7. Mulholland Drive (David Lynch, 2001)
8. The Congress (Ari Folman, 2013)
9. Sympathy for Mr Vengeance (Park Chan-wook, 2002)
10. Synecdoche, New York (Charlie Kaufman, 2008)
1. Yi Yi: A One and a Two (Edward Yang, 2000)
2. There Will Be Blood (Paul Thomas Anderson, 2007)
3. Boyhood (Richard Linklater, 2014)
4. Still Life (Jia Zhangke, 2006)
5. Archipelago (Joanna Hogg, 2010)
6. The Headless Woman (Lucrecia Martel, 2008)
7. The Act of Killing (Joshua Oppenheimer, 2012)
8. Caché (Michael Haneke, 2005)
9. Divine Intervention (Elia Suleiman, 2002)
10. Crimson Gold (Jafar Panahi, 2003)
1. Zodiac (David Fincher, 2007)
2. Inside Llewyn Davis (Joel and Ethan Coen, 2013)
3. There Will Be Blood (Paul Thomas Anderson, 2007)
4. Spider-Man 2 (Sam Raimi, 2004)
5. Oldboy (Park Chan-wook, 2003)
6. Inglourious Basterds (Quentin Tarantino, 2009)
7. Frozen (Chris Buck and Jennifer Lee, 2013)
8. 25th Hour (Spike Lee, 2002)
9. Requiem for a Dream (Darren Aronofsky, 2000)
1

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [10]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"^[^(]*"
movie_name = re.findall(regex_for_mname,movie_sample)
movie_name[0]

'1. Zero Dark Thirty '

In [11]:

#Practice/Build your regular expressions here
for line in all_p:
    if line.strong is not None:
            movie_info = line.next_sibling
            movie_list  = movie_info.find_all(string=True)
            print(critic_info)
            for movie in movie_list:
                movie_name = re.findall(regex_for_mname,movie)[0]
                print(movie_name)

Raymond Zhou – China Daily (China)
1. Mulholland Drive 
2. In the Mood for Love 
3. The Tree of Life 
4. Yi Yi: A One and a Two 
5. Goodbye to Language 
6. The White Meadows 
7. Night Across the Street 
8. Certified Copy 
9. Sparrow 
10. Fados 
Raymond Zhou – China Daily (China)
1. In the Mood for Love 
2. Eternal Sunshine of the Spotless Mind 
3. Syndromes and a Century 
4. Spirited Away 
5. The Act of Killing 
6. The Grand Budapest Hotel 
7. The New World 
8. Certified Copy 
9. The World 
10. Elephant 
Raymond Zhou – China Daily (China)
1. Zero Dark Thirty 
2. A History of Violence 
3. The Grand Budapest Hotel 
4. Stories We Tell 
5. Casino Royale 
6. Eternal Sunshine of the Spotless Mind 
7. Tabu 
8. Snow White 
9. Frozen River 
10. Gosford Park 
Raymond Zhou – China Daily (China)
1. In the Mood for Love 
2. Mulholland Drive 
3. Inception 
4. Pan's Labyrinth 
5. Caché 
6. Grizzly Man 
7. 4 Months, 3 Weeks & 2 Days 
8. Holy Motors 
9. The Last of the Unjust 
10. There Will Be Blood 


6. Mulholland Drive 
7. Import Export 
8. Son of Saul 
9. Kill Bill: Vol. 1 
10. The Revenant 
Raymond Zhou – China Daily (China)
1. Tropical Malady 
2. Mulholland Drive 
3. The Turin Horse 
4. Tie Xi Qu: West of the Tracks 
5. Le filmeur 
6. Holy Motors 
7. Elephant 
8. A Touch of Sin 
9. Pan's Labyrinth 
10. Spirited Away 
Raymond Zhou – China Daily (China)
1. The New World 
2. Capitalism: Child Labor 
3. Psalm III: 'Night of the Meek' 
4. Goodbye to Language 
5. Daylight Moon 
6. Oldboy 
7. A Commuter’s Life 
8. The Fourth Watch 
9. Our Daily Bread 
10. The World 
Raymond Zhou – China Daily (China)
1. Ankhon Dekhi 
2. Court 
3. LSD: Love, Sex Aur Dhokha 
4. Monsoon Wedding 
5. Dev D 
6. Paan Singh Tomar 
7. Udaan 
8. Hazaaron Khwaishein Aisi 
9. Maqbool 
10. Lagaan: Once Upon a Time in India 
Raymond Zhou – China Daily (China)
1. There Will Be Blood 
2. Mulholland Drive 
3. AI: Artificial Intelligence 
4. Blue Is the Warmest Color 
5. Spirited Away 
6. Once Upon a Time in Anatolia 


Raymond Zhou – China Daily (China)
1. Holy Motors 
2. Lifeline 
3. Certified Copy 
4. Kung Fu Hustle 
5. Instructions for a Light and Sound Machine 
6. Un lac 
7. Detention 
8. A Vingança de Uma Mulher 
9. Mia Madre 
10. Femme Fatale 
Raymond Zhou – China Daily (China)
1. The Wolf of Wall Street 
2. No Country For Old Men 
3. Spirited Away 
4. Million Dollar Baby 
5. The Ghost Writer 
6. The Son's Room 
7. A History of Violence 
8. Talk to Her 
9. Before the Devil Knows You're Dead 
10. Match Point 
Raymond Zhou – China Daily (China)
1. Carlos 
2. Zodiac 
3. Leviathan 
4. Mulholland Drive 
5. The Assassination of Jesse James by the Coward Robert Ford 
6. The Incredibles 
7. Children of Men 
8. Fantastic Mr Fox 
9. Grizzly Man 
10. Brokeback Mountain 
Raymond Zhou – China Daily (China)
1. You Ain’t Seen Nothin’ Yet 
2. No Home Movie 
3. The Romance of Astrea and Celadon 
4. The Strange Case of Angelica 
5. Michelangelo Eye to Eye 
6. Warming by the Devil's Fire 
7. The New World 
8. Caf

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the movie name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional, but maybe helpful to keep)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [12]:
regex_for_name = r"^[^–]*"
regex_for_org = r"(?<=\–).+?(?=\()"
regex_for_cn = r"(?<=\().+?(?=\))"
regex_for_mname = r"^[^(]*"
regex_for_director = r"(?<=\().+?(?=\,)"
regex_for_myear = r"(?<=\,).+?(?=\))"

for line in all_p:
    if line.strong:
        critic_info = line.strong.string
        name = re.findall(regex_for_name,critic_info)[0] 
        org = re.findall(regex_for_org,critic_info)[0] 
        cn = re.findall(regex_for_cn,critic_info)[0] 
               
        movie_info = line.next_sibling
        movie_list  = movie_info.find_all(string=True)
        print(name,org,cn)
        for movie in movie_list:
            movie_name = re.findall(regex_for_mname,movie)[0]
            movie_director = re.findall(regex_for_director,movie)[0]
            movie_year = re.findall(regex_for_myear,movie)[0]
            print(movie_name)
            print(movie_director)
            print(movie_year)
            print("-------")

Simon Abrams   Freelance film critic  US
1. Mulholland Drive 
David Lynch
 2001
-------
2. In the Mood for Love 
Wong Kar-wai
 2000
-------
3. The Tree of Life 
Terrence Malick
 2011
-------
4. Yi Yi: A One and a Two 
Edward Yang
 2000
-------
5. Goodbye to Language 
Jean-Luc Godard
 2014
-------
6. The White Meadows 
Mohammad Rasoulof
 2009
-------
7. Night Across the Street 
Raoul Ruiz
 2012
-------
8. Certified Copy 
Abbas Kiarostami
 2010
-------
9. Sparrow 
Johnnie To
 2008
-------
10. Fados 
Carlos Saura
 2007
-------
Sam Adams   Freelance film critic  US
1. In the Mood for Love 
Wong Kar-wai
 2000
-------
2. Eternal Sunshine of the Spotless Mind 
Michel Gondry
 2004
-------
3. Syndromes and a Century 
Apichatpong Weerasethakul
 2006
-------
4. Spirited Away 
Hayao Miyazaki
 2001
-------
5. The Act of Killing 
Joshua Oppenheimer
 2012
-------
6. The Grand Budapest Hotel 
Wes Anderson
 2014
-------
7. The New World 
Terrence Malick
 2004
-------
8. Certified Copy 
Abbas Kiarostami

2. Talk to Her 
Pedro Almodóvar
 2002
-------
3. The Assassin 
Hou Hsiao-hsien
 2015
-------
4. Million Dollar Baby 
Clint Eastwood
 2004
-------
5. In the Mood for Love 
Wong Kar-wai
 2000
-------
6. Leviathan 
Andrey Zvyagintsev
 2014
-------
7. A Touch of Sin 
Jia Zhangke
 2013
-------
8. Distant 
Nuri Bilge Ceylan
 2002
-------
9. The New World 
Terrence Malick
 2005
-------
10. A Prophet 
Jacques Audiard
 2009
-------
Lizelle Bisschoff   University of Glasgow  UK
1. Timbuktu 
Abderrahmane Sissako
 2014
-------
2. Blue Is the Warmest Color 
Abdellatif Kechiche
 2013
-------
3. Moolaadé 
Ousmane Sembèène
 2004
-------
4. A Separation 
Asghar Farhadi
 2011
-------
5. The Secret in Their Eyes 
Juan José Campanella
 2009
-------
6. Oldboy 
Park Chan-wook
 2003
-------
7. In the Mood for Love 
Wong Kar-wai
 2000
-------
8. Fat Girl 
Catherine Breillat
 2001
-------
9. The Orphanage 
J. A. Bayona
 2007
-------
10. The Hunt 
Thomas Vinterberg
 2012
-------
Christian Blauvelt   BBC Culture

 2002
-------
5. Something Necessary 
Judy Kibinge
 2013
-------
6. Moolaadé 
Ousmane Sembène
 2004
-------
7. Sembène! 
Samba Gadjigo and Jason Silverman
 2015
-------
8. 5 Broken Cameras 
Emad Burnat and Guy Davidi
 2011
-------
9. Hooligan Sparrow 
Nanfu Wang
 2016
-------
10. 7 Letters 
Boo Junfeng
 Eric Khoo, Jack Neo, K. Rajagopal, Tan Pin Pin, Royston Tan and Kelvin Tong, 2015
-------
Alonso Duralde   TheWrap  US
1. Synecdoche, New York 
Charlie Kaufman
 New York (Charlie Kaufman, 2008
-------
2. Brokeback Mountain 
Ang Lee
 2005
-------
3. Weekend 
Andrew Haigh
 2011
-------
4. 4 Months, 3 Weeks & 2 Days 
Cristian Mungiu
 3 Weeks & 2 Days (Cristian Mungiu, 2007
-------
5. Spirited Away 
Hayao Miyazaki
 2001
-------
6. How to Survive a Plague 
David France
 2012
-------
7. Talk to Her 
Pedro Almodóvar
 2002
-------
8. American Splendor 
Robert Pulcini and Shari Springer Berman
 2003
-------
9. In the Mood for Love 
Wong Kar-wai
 2000
-------
10. Far From Heaven 
Todd Haynes
 200

 2002
-------
5. 35 Shots of Rum 
Claire Denis
 2008
-------
6. Closed Curtain 
Jafar Panahi
 2013
-------
7. Tangerine 
Sean Baker
 2015
-------
8. The Royal Tenenbaums 
Wes Anderson
 2001
-------
9. Spring Breakers 
Harmony Korine
 2012
-------
10. Uncle Boonmee Who Can Recall His Past Lives 
Apichatpong Weerasethakul
 2010
-------
Shiguehiko Hasumi   University of Tokyo  Japan
1. Notre musique 
Jean-Luc Godard
 2004
-------
2. Triple Agent 
Éric Rohmer
 2004
-------
3. The Assassin 
Hou Hsiao-hsien
 2015
-------
4. Gran Torino 
Clint Eastwood
 2008
-------
5. Horse Money 
Pedro Costa
 2014
-------
6. Fantastic Mr Fox 
Wes Anderson
 2012
-------
7. Holy Motors 
Leos Carax
 2012
-------
8. Death Proof – Grindhouse 
Quentin Tarantino
 2007
-------
9. Déjà Vu 
Tony Scott
 2006
-------
10. Seventh Code 
Kiyoshi Kurosawa
 2013
-------
Katarina Hedrén   Freelance film critic  South Africa
1. Waiting for Happiness 
Abderrahmane Sissako
 2002
-------
2. The White Ribbon 
Michael Haneke
 2009

-------
10. 25th Hour 
Spike Lee
 2002
-------
Andreas Kilb   Frankfurter Allgemeine Zeitung  Germany
1. In the Mood for Love 
Wong Kar-wai
 2000
-------
2. Dogville 
Lars von Trier
 2003
-------
3. Ten 
Abbas Kiarostami
 2002
-------
4. The White Ribbon 
Michael Haneke
 2009
-------
5. The Circle 
Jafar Panahi
 2000
-------
6. Bad Education 
Pedro Almodóvar
 2005
-------
7. Rust and Bone 
Jacques Audiard
 2012
-------
8. 5x2 
François Ozon
 2004
-------
9. Russian Ark 
Aleksandr Sokurov
 2002
-------
10. Far From Heaven 
Todd Haynes
 2002
-------
Uri Klein   Haaretz  Israel
1. The Last of the Unjust 
Claude Lanzmann
 2013
-------
2. Once Upon a Time in Anatolia 
Nuri Bilge Ceylan
 2011
-------
3. 4 Months, 3 Weeks & 2 Days 
Cristian Mungiu
 3 Weeks & 2 Days (Cristian Mungiu, 2007
-------
4. Police, Adjective 
Corneliu Porumboiu
 Adjective (Corneliu Porumboiu, 2009
-------
5. Far From Heaven 
Todd Haynes
 2002
-------
6. The Pianist 
Roman Polanski
 2002
-------
7. Winter Sleep 
Nuri B

 2004
-------
2. Crouching Tiger, Hidden Dragon 
Ang Lee
 Hidden Dragon (Ang Lee, 2000
-------
3. Holy Motors 
Leos Carax
 2012
-------
4. City of God 
Fernando Meirelles and Kátia Lund
 2002
-------
5. Enter the Void 
Gaspar Noé
 2009
-------
6. Hedwig and the Angry Inch 
John Cameron Mitchell
 2001
-------
7. Melancholia 
Lars von Trier
 2011
-------
8. Mad Max: Fury Road 
George Miller
 2015
-------
9. The Lord of the Rings: The Fellowship of the Ring 
Peter Jackson
 2001
-------
10. Grizzly Man 
Werner Herzog
 2005
-------
Kim Morgan   Sight & Sound/Criterion  US
1. Inherent Vice 
Paul Thomas Anderson
 2014
-------
2. Mulholland Drive 
David Lynch
 2001
-------
3. Melancholia 
Lars von Trier
 2011
-------
4. There Will Be Blood 
Paul Thomas Anderson
 2007
-------
5. Under the Skin 
Jonathan Glazer
 2013
-------
6. Fat Girl 
Catherine Breillat
 2001
-------
7. A Serious Man 
Joel and Ethan Coen
 2009
-------
8. Battle Royale 
Kinji Fukasaku
 2000
-------
9. The Turin Horse 
Béla Tar

Denis Villeneuve
 2010
-------
Tim Robey   The Daily Telegraph  UK
1. Mulholland Drive 
David Lynch
 2001
-------
2. Synecdoche, New York 
Charlie Kaufman
 New York (Charlie Kaufman, 2008
-------
3. Birth 
Jonathan Glazer
 2004
-------
4. Elena 
Andrey Zvyagintsev
 2011
-------
5. Carol 
Todd Haynes
 2015
-------
6. Tabu 
Miguel Gomes
 2012
-------
7. Master and Commander: The Far Side of the World 
Peter Weir
 2003
-------
8. Margaret 
Kenneth Lonergan
 2011
-------
9. There Will Be Blood 
Paul Thomas Anderson
 2007
-------
10. 12 Years a Slave 
Steve McQueen
 2013
-------
Tasha Robinson   The Verge  US
1. 25th Hour 
Spike Lee
 2002
-------
2. City of God 
Fernando Meirelles and Kátia Lund
 2002
-------
3. The Act of Killing 
Joshua Oppenheimer
 2012
-------
4. The Prestige 
Christopher Nolan
 2006
-------
5. Spirited Away 
Hayao Miyazaki
 2001
-------
6. The Incredibles 
Brad Bird
 2004
-------
7. Gosford Park 
Robert Altman
 2001
-------
8. Memento 
Christopher Nolan
 2000
-------
9

-------
2. Distant 
Nuri Bilge Ceylan
 2002
-------
3. A Separation 
Asghar Farhadi
 2011
-------
4. Samson & Delilah 
Warwick Thornton
 2009
-------
5. Leviathan 
Andrey Zvyagintsev
 2014
-------
6. Still Walking 
Hirokazu Koreeda
 2008
-------
7. Talk to Her 
Pedro Almodóvar
 2002
-------
8. Million Dollar Baby 
Clint Eastwood
 2004
-------
9. No Country For Old Men 
Joel and Ethan Coen
 2007
-------
10. The Man Without A Past 
Aki Kaurismäki
 2002
-------
Cédric Succivalli   International Cinephile Society  Italy
1. Mysteries of Lisbon 
Raoul Ruiz
 2010
-------
2. Margaret 
Kenneth Lonergan
 2011
-------
3. The New World 
Terrence Malick
 2005
-------
4. Secret Things 
Jean-Claude Brisseau
 2002
-------
5. La Ciénaga 
Lucrecia Martel
 2001
-------
6. Toni Erdmann 
Maren Ade
 2016
-------
7. In the Family 
Patrick Wang
 2011
-------
8. Tabu 
Miguel Gomes
 2012
-------
9. Gerry 
Gus Van Sant
 2002
-------
10. Tropical Malady 
Apichatpong Weerasethakul
 2004
-------
Alin Tasciyan   Sta

IndexError: list index out of range

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [19]:
#You will want to build a list that gets appended to list_of_what
#Try to figure out how you want to append things
#That is, how you want to organize your data
#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info

regex_for_name = r"^[^–]*"
regex_for_org = r"(?<=\–).+?(?=\()"
regex_for_cn = r"(?<=\().+?(?=\))"
regex_for_mname = r"^[^(]*"
regex_for_director = r"(?<=\().+?(?=\,)"
regex_for_myear = r"(?<=\,).+?(?=\))"
# Create an empty list list_all = []

list_all = []
for line in all_p:
    if line.strong is not None:
        critic_info = line.strong.string
        name = re.findall(regex_for_name,critic_info)[0]
        org = re.findall(regex_for_org,critic_info)[0]
        cn = re.findall(regex_for_cn,critic_info)[0]
        movie_info = line.next_sibling
        movie_list  = movie_info.find_all(string=True)
        
        for movie in movie_list:
            movie_all = []
            movie_name = re.findall(regex_for_mname,movie)[0]
            movie_director = re.findall(regex_for_director,movie)[0]
            movie_year = re.findall(regex_for_myear,movie)[0]
            #movie_all.append ([movie_name, movie_director, movie_year])
            movie_all = ([name, org, cn, movie_name, movie_director, movie_year])
            list_all.append(movie_all)
            print(list_all)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



IndexError: list index out of range

In [14]:
##Take a peek at your final lists of lists
list_all

[['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '1. Mulholland Drive ',
  'David Lynch',
  ' 2001'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '2. In the Mood for Love ',
  'Wong Kar-wai',
  ' 2000'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '3. The Tree of Life ',
  'Terrence Malick',
  ' 2011'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '4. Yi Yi: A One and a Two ',
  'Edward Yang',
  ' 2000'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '5. Goodbye to Language ',
  'Jean-Luc Godard',
  ' 2014'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '6. The White Meadows ',
  'Mohammad Rasoulof',
  ' 2009'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '7. Night Across the Street ',
  'Raoul Ruiz',
  ' 2012'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '8. Certified Copy ',
  'Abbas Kiarostami',
  ' 2010'],
 ['Simon Abrams ',
  ' Freelance film critic ',
  'US',
  '9. Sparrow ',
 

In [15]:
len(list_all)

1770

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [20]:
import pandas as pd
import numpy as np



In [21]:
# col_names = ['movie', 'director','m_year','crit_rank', 'critic','crit_org','crit_cn']
# col_names = ['name', 'org', 'cn', 'movie_name', 'movie_director', 'movie_year']
col_names = ['name', 'org', 'cn', 'movie_name', 'movie_director', 'movie_year']
df = pd.DataFrame.from_records(list_all, columns = col_names)

In [22]:
df.head()

Unnamed: 0,name,org,cn,movie_name,movie_director,movie_year
0,Simon Abrams,Freelance film critic,US,1. Mulholland Drive,David Lynch,2001
1,Simon Abrams,Freelance film critic,US,2. In the Mood for Love,Wong Kar-wai,2000
2,Simon Abrams,Freelance film critic,US,3. The Tree of Life,Terrence Malick,2011
3,Simon Abrams,Freelance film critic,US,4. Yi Yi: A One and a Two,Edward Yang,2000
4,Simon Abrams,Freelance film critic,US,5. Goodbye to Language,Jean-Luc Godard,2014


In [23]:
df.to_csv('project_bbc.csv', index=False)

In [24]:
pd.read_csv('project_bbc.csv').head()

Unnamed: 0,name,org,cn,movie_name,movie_director,movie_year
0,Simon Abrams,Freelance film critic,US,1. Mulholland Drive,David Lynch,2001
1,Simon Abrams,Freelance film critic,US,2. In the Mood for Love,Wong Kar-wai,2000
2,Simon Abrams,Freelance film critic,US,3. The Tree of Life,Terrence Malick,2011
3,Simon Abrams,Freelance film critic,US,4. Yi Yi: A One and a Two,Edward Yang,2000
4,Simon Abrams,Freelance film critic,US,5. Goodbye to Language,Jean-Luc Godard,2014
