## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Wednesday of this week. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies then immediately follow her/him. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just email me and I will send you the code.)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, email me and I will send you working code so you can move on to the next step.


### Getting started: Data Architecture
You can come up with your own data scheme for this, but the one I'm recommending is three separate lists:

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

For this project, if you're interested in recalling your knowledge of SQL, you can do the additional step of entering your transformed data into postgres. Or you can just stick with pandas.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requestions like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [1]:
##Import your libraries: Beautiful soup, urllib, and re (For regular expressions)
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
# read the URL, and put the HTML page into beautiful soup
my_url = "http://floatingmedia.com/columbia/BBC.html"
raw_html = requests.get(my_url).content
#soup_doc = BeautifulSoup(raw_html, "html.parser")
#print(raw_html)

In [3]:

soup_doc = BeautifulSoup(raw_html, "html.parser")

In [4]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 

all_info = soup_doc.find(class_="body-content")
print(all_info)             
              


<div class="body-content">
<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p><p><strong>Simon Abrams – Freelance film critic (US)</strong></p><p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Ra

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [5]:
#find_all
all_p = all_info.find_all('p')

In [8]:
all_p

[<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p>,
 <p><strong>Simon Abrams – Freelance film critic (US)</strong></p>,
 <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7.

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step email me with your code!



In [6]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
for p_line in all_p:

    if p_line.strong is not None:
        critic_line = p_line.strong.string
        movie_line = p_line.next_sibling
  
        print(critic_line.string, movie_line)


Simon Abrams – Freelance film critic (US) <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7. Night Across the Street (Raoul Ruiz, 2012)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. Sparrow (Johnnie To, 2008)<br/>10. Fados (Carlos Saura, 2007)</p>
Sam Adams – Freelance film critic (US) <p>1. In the Mood for Love (Wong Kar-wai, 2000)<br/>2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)<br/>3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)<br/>4. Spirited Away (Hayao Miyazaki, 2001)<br/>5. The Act of Killing (Joshua Oppenheimer, 2012)<br/>6. The Grand Budapest Hotel (Wes Anderson, 2014)<br/>7. The New World (Terrence Malick, 2004)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. The World (Jia Zhangke, 2004

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [7]:
##Practice/Build your regular expressions here
#crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"(.*)\–"
regex_for_org = r"– (.*) [(]"
regex_for_cn = r"(?<=\().+?(?=\))"

#name = re.findall(regex_for_name,crit_sample)
#name[0]

#org = re.findall(regex_for_org,crit_sample)
#org[0]

#cn = re.findall(regex_for_cn,crit_sample)
#cn[0]


In [8]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it

for p_line in all_p:

    if p_line.strong is not None:
        
        critic_line = p_line.strong.string
        critic_name = re.findall(regex_for_name, critic_line)
        critic_org = re.findall(regex_for_org, critic_line)
        critic_cn = re.findall(regex_for_cn, critic_line)
        
        movie_line = p_line.next_sibling
  
        #print(critic_line.string, movie_line)
        #print(critic_name)
        #print(critic_cn)
    
        

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [9]:
#Practice/Build your regular expressions here
#movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
#movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_rank = r"^\d{1,2}."
regex_for_mname = r"^\d{1,2}.([^\(]+)"
regex_for_year = r"(\d*)\)$"
regex_for_director = r"\((.*),"
#movie_name = re.findall(regex_for_rank,movie_harder)
#movie_name[0]





**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the move name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [10]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie

for p_line in all_p:

    if p_line.strong is not None:
        
        critic_line = p_line.strong.string
        movie_line = p_line.next_sibling
  
        dir_list = movie_line.find_all(string=True)
        print("--")
        for each_movie in dir_list:
            
            single_movie = each_movie
            rank = re.findall(regex_for_rank, single_movie)
            mname = re.findall(regex_for_mname, single_movie)
            director = re.findall(regex_for_director, single_movie)
            year = re.findall(regex_for_year, single_movie)
            
            print(rank, mname, director, year)      
            

--
['1.'] [' Mulholland Drive '] ['David Lynch'] ['2001']
['2.'] [' In the Mood for Love '] ['Wong Kar-wai'] ['2000']
['3.'] [' The Tree of Life '] ['Terrence Malick'] ['2011']
['4.'] [' Yi Yi: A One and a Two '] ['Edward Yang'] ['2000']
['5.'] [' Goodbye to Language '] ['Jean-Luc Godard'] ['2014']
['6.'] [' The White Meadows '] ['Mohammad Rasoulof'] ['2009']
['7.'] [' Night Across the Street '] ['Raoul Ruiz'] ['2012']
['8.'] [' Certified Copy '] ['Abbas Kiarostami'] ['2010']
['9.'] [' Sparrow '] ['Johnnie To'] ['2008']
['10.'] [' Fados '] ['Carlos Saura'] ['2007']
--
['1.'] [' In the Mood for Love '] ['Wong Kar-wai'] ['2000']
['2.'] [' Eternal Sunshine of the Spotless Mind '] ['Michel Gondry'] ['2004']
['3.'] [' Syndromes and a Century '] ['Apichatpong Weerasethakul'] ['2006']
['4.'] [' Spirited Away '] ['Hayao Miyazaki'] ['2001']
['5.'] [' The Act of Killing '] ['Joshua Oppenheimer'] ['2012']
['6.'] [' The Grand Budapest Hotel '] ['Wes Anderson'] ['2014']
['7.'] [' The New World '] [

['7.'] [' Synecdoche, New York '] ['Charlie Kaufman'] ['2008']
['8.'] [' Love Exposure '] ['Sion Sono'] ['2008']
['9.'] [' The Tree of Life '] ['Terrence Malick'] ['2011']
['10.'] [' Casino Royale '] ['Martin Campbell'] ['2006']
--
['1.'] [' Oldboy '] ['Park Chan-wook'] ['2003']
['2.'] [' Children of Men '] ['Alfonso Cuarón'] ['2006']
['3.'] [' Mulholland Drive '] ['David Lynch'] ['2001']
['4.'] [' The Secret in Their Eyes '] ['Juan José Campanella'] ['2009']
['5.'] [' Sicario '] ['Denis Villeneuve'] ['2015']
['6.'] [' No '] ['Pablo Larraín'] ['2012']
['7.'] [' Donnie Darko '] ['Richard Kelly'] ['2001']
['8.'] [' Eternal Sunshine of the Spotless Mind '] ['Michel Gondry'] ['2004']
['9.'] [' The Social Network '] ['David Fincher'] ['2010']
['10.'] [' Whiplash '] ['Damien Chazelle'] ['2014']
--
['1.'] [' Ex Machina '] ['Alex Garland'] ['2015']
['2.'] [' Inside Out '] ['Pete Docter'] ['2015']
['3.'] [' Animal Kingdom '] ['David Michôd'] ['2010']
['4.'] [' Apocalypto '] ['Mel Gibson'] ['200

['2.'] [' Yi Yi: A One and a Two '] ['Edward Yang'] ['2000']
['3.'] [' Before Sunset '] ['Richard Linklater'] ['2004']
['4.'] [' You Can Count On Me '] ['Kenneth Lonergan'] ['2000']
['5.'] [' Inside Out '] ['Pete Docter'] ['2015']
['6.'] [' Morvern Callar '] ['Lynne Ramsay'] ['2012']
['7.'] [' Stories We Tell '] ['Sarah Polley'] ['2012']
['8.'] [' Animal Kingdom '] ['David Michôd'] ['2010']
['9.'] [' Attack the Block '] ['Joe Cornish'] ['2011']
['10.'] [' Jackass 3D '] ['Jeff Tremaine'] ['2010']
--
['1.'] [' Mulholland Drive '] ['David Lynch'] ['2001']
['2.'] [' AI: Artificial Intelligence '] ['Steven Spielberg'] ['2001']
['3.'] [' The Tree of Life '] ['Terrence Malick'] ['2011']
['4.'] [' Syndromes and a Century '] ['Apichatpong Weerasethakul'] ['2006']
['5.'] [' What Time Is It There? '] ['Tsai Ming-liang'] ['2001']
['6.'] [' The Intruder '] ['Claire Denis'] ['2004']
['7.'] [' The Son '] ['Jean-Pierre and Luc Dardenne'] ['2002']
['8.'] [' Before Sunset '] ['Richard Linklater'] ['2004

['1.'] [' Mulholland Drive '] ['David Lynch'] ['2001']
['2.'] [' Synecdoche, New York '] ['Charlie Kaufman'] ['2008']
['3.'] [' Birth '] ['Jonathan Glazer'] ['2004']
['4.'] [' Elena '] ['Andrey Zvyagintsev'] ['2011']
['5.'] [' Carol '] ['Todd Haynes'] ['2015']
['6.'] [' Tabu '] ['Miguel Gomes'] ['2012']
['7.'] [' Master and Commander: The Far Side of the World '] ['Peter Weir'] ['2003']
['8.'] [' Margaret '] ['Kenneth Lonergan'] ['2011']
['9.'] [' There Will Be Blood '] ['Paul Thomas Anderson'] ['2007']
['10.'] [' 12 Years a Slave '] ['Steve McQueen'] ['2013']
--
['1.'] [' 25th Hour '] ['Spike Lee'] ['2002']
['2.'] [' City of God '] ['Fernando Meirelles and Kátia Lund'] ['2002']
['3.'] [' The Act of Killing '] ['Joshua Oppenheimer'] ['2012']
['4.'] [' The Prestige '] ['Christopher Nolan'] ['2006']
['5.'] [' Spirited Away '] ['Hayao Miyazaki'] ['2001']
['6.'] [' The Incredibles '] ['Brad Bird'] ['2004']
['7.'] [' Gosford Park '] ['Robert Altman'] ['2001']
['8.'] [' Memento '] ['Christop

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?


In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [11]:
#figure out how you're going to collect your clean information
list_of_movies = []
#for loop that goes throug all the <p> elements
for p_line in all_p:
    
    #if strong (begins with the critic)
    if p_line.strong is not None:

        #critic_info= get the critic line
        critic_line = p_line.strong.string
        #critic_name = re.search(regex,critic_info)
        critic_name = re.search(regex_for_name, critic_line)
        #critic_org = re.search(regex,critic_info)
        critic_org = re.search(regex_for_org, critic_line)
        #critic_cn = re.search(regex,critic_info)
        critic_cn = re.search(regex_for_cn, critic_line)
        
        #movie_info = get movie line using next_sibling
        movie_line = p_line.next_sibling
        
        #get each movie string
        dir_list = movie_line.find_all(string=True)
        #loop through each movie_line (#1 through #10)
        for single_movie in dir_list:
            
            #movie_rank = re.search(regex,movie_line)
            rank = re.search(regex_for_rank, single_movie)
            #movie_name = re.search(regex,movie_line)
            mname = re.search(regex_for_mname, single_movie)
            #movie_dir = re.search(regex,movie_line)
            director = re.search(regex_for_director, single_movie)
            #movie_year = re.search(regex,movie_line)
            year = re.search(regex_for_year, single_movie)
            #this will happen 10 times
            #print(rank)
            
            try:
                movie_dictionary = {
                    'movie_name': mname.group(1),
                    'director': director.group(1),
                    'year': year.group(1),
                    'rank': rank.group(0),
                    'critic': critic_name.group(1),
                    'crit_organisation': critic_org.group(1),
                    'crit_country': critic_cn.group(0)
                }
            except:
                print("No match found in", single_movie)
            
            list_of_movies.append(movie_dictionary)

#You will want to build a list tickets appended to list_of_what
#Try to figure out how you want to append things
#That is, how you want to organize your data


No match found in If you would like to comment on this story or anything else you have seen on BBC Culture, head over to our 
No match found in Facebook
No match found in  page or message us on 
No match found in Twitter
No match found in .


In [118]:
##Take a peek at your final lists of lists
list_of_movies

[{'movie_name': ' Mulholland Drive ',
  'director': 'David Lynch',
  'year': '2001',
  'rank': '1.',
  'critic': 'Simon Abrams ',
  'crit_organisation': 'Freelance film critic',
  'crit_country': 'US'},
 {'movie_name': ' In the Mood for Love ',
  'director': 'Wong Kar-wai',
  'year': '2000',
  'rank': '2.',
  'critic': 'Simon Abrams ',
  'crit_organisation': 'Freelance film critic',
  'crit_country': 'US'},
 {'movie_name': ' The Tree of Life ',
  'director': 'Terrence Malick',
  'year': '2011',
  'rank': '3.',
  'critic': 'Simon Abrams ',
  'crit_organisation': 'Freelance film critic',
  'crit_country': 'US'},
 {'movie_name': ' Yi Yi: A One and a Two ',
  'director': 'Edward Yang',
  'year': '2000',
  'rank': '4.',
  'critic': 'Simon Abrams ',
  'crit_organisation': 'Freelance film critic',
  'crit_country': 'US'},
 {'movie_name': ' Goodbye to Language ',
  'director': 'Jean-Luc Godard',
  'year': '2014',
  'rank': '5.',
  'critic': 'Simon Abrams ',
  'crit_organisation': 'Freelance fi

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [12]:
df = pd.DataFrame(list_of_movies)
df.head(20)

Unnamed: 0,crit_country,crit_organisation,critic,director,movie_name,rank,year
0,US,Freelance film critic,Simon Abrams,David Lynch,Mulholland Drive,1.0,2001
1,US,Freelance film critic,Simon Abrams,Wong Kar-wai,In the Mood for Love,2.0,2000
2,US,Freelance film critic,Simon Abrams,Terrence Malick,The Tree of Life,3.0,2011
3,US,Freelance film critic,Simon Abrams,Edward Yang,Yi Yi: A One and a Two,4.0,2000
4,US,Freelance film critic,Simon Abrams,Jean-Luc Godard,Goodbye to Language,5.0,2014
5,US,Freelance film critic,Simon Abrams,Mohammad Rasoulof,The White Meadows,6.0,2009
6,US,Freelance film critic,Simon Abrams,Raoul Ruiz,Night Across the Street,7.0,2012
7,US,Freelance film critic,Simon Abrams,Abbas Kiarostami,Certified Copy,8.0,2010
8,US,Freelance film critic,Simon Abrams,Johnnie To,Sparrow,9.0,2008
9,US,Freelance film critic,Simon Abrams,Carlos Saura,Fados,10.0,2007


In [14]:
df.dtypes

crit_country         object
crit_organisation    object
critic               object
director             object
movie_name           object
rank                 object
year                 object
dtype: object

In [15]:
df.rank=df['rank'].str.replace(".","").head()
df.rank.astype(int).head()

0    1
1    2
2    3
3    4
4    5
Name: rank, dtype: int64

In [16]:
#Director with most total votes
df.director.unique()

array(['David Lynch', 'Wong Kar-wai', 'Terrence Malick', 'Edward Yang',
       'Jean-Luc Godard', 'Mohammad Rasoulof', 'Raoul Ruiz',
       'Abbas Kiarostami', 'Johnnie To', 'Carlos Saura', 'Michel Gondry',
       'Apichatpong Weerasethakul', 'Hayao Miyazaki',
       'Joshua Oppenheimer', 'Wes Anderson', 'Jia Zhangke',
       'Gus Van Sant', 'Kathryn Bigelow', 'David Cronenberg',
       'Sarah Polley', 'Martin Campbell', 'Miguel Gomes', 'Pablo Berger',
       'Courtney Hunt', 'Robert Altman', 'Christopher Nolan',
       'Guillermo Del Toro', 'Michael Haneke', 'Werner Herzog',
       'Cristian Mungiu', 'Leos Carax', 'Claude Lanzmann',
       'Paul Thomas Anderson', 'Kenneth Lonergan', 'Mary Harron',
       'Jessica Hausner', 'Andrea Arnold', 'Richard Linklater',
       'Pablo Larraín', 'Joel and Ethan Coen', 'Asghar Farhadi',
       'Andrew Stanton and Lee Unkrich', 'Zhang Yimou', 'Martin Scorsese',
       'Bong Joon-ho', 'Paul Greengrass', 'Steven Soderbergh',
       'Danièle Huillet a

In [17]:
#-Director with most total movies on the list
df.groupby('director').movie_name.nunique().sort_values(ascending=False).head()

director
Quentin Tarantino    6
Clint Eastwood       5
Nuri Bilge Ceylan    5
Tsai Ming-liang      5
Lars von Trier       4
Name: movie_name, dtype: int64

In [18]:
#a the movie with most total nominations on the list
df.movie_name.value_counts().head()

 In the Mood for Love     49
 Mulholland Drive         47
 There Will Be Blood      35
 Spirited Away            34
 Boyhood                  30
Name: movie_name, dtype: int64

In [19]:
#b the movie with most total nominations on the list
df.groupby('director').movie_name.value_counts().sort_values(ascending=False).head()

director              movie_name            
Wong Kar-wai           In the Mood for Love     49
David Lynch            Mulholland Drive         47
Paul Thomas Anderson   There Will Be Blood      35
Hayao Miyazaki         Spirited Away            34
Richard Linklater      Boyhood                  30
Name: movie_name, dtype: int64

In [20]:
df.groupby('director').movie_name.value_counts().head()

director              movie_name                 
Abbas Kiarostami       Certified Copy                9
                       Ten                           4
                       Like Someone In Love          2
Abdellatif Kechiche    Blue Is the Warmest Color     7
Abderrahmane Sissako   Timbuktu                      9
Name: movie_name, dtype: int64

In [21]:
#From which country are the critics?
df.crit_country.value_counts()/10

US              82.0
UK              18.0
France           5.0
Germany          5.0
Canada           5.0
India            5.0
Cuba             5.0
Australia        4.0
Colombia         4.0
Israel           4.0
Italy            4.0
UAE              3.0
Lebanon          3.0
Turkey           2.0
South Korea      2.0
Chile            2.0
Argentina        2.0
Mexico           2.0
Singapore        2.0
Austria          2.0
China            1.5
Belgium          1.0
Indonesia        1.0
Taiwan           1.0
Kazakhstan       1.0
Philippines      1.0
Egypt            1.0
Japan            1.0
Switzerland      1.0
Hong Kong        1.0
Namibia          1.0
South Africa     1.0
Brazil           1.0
Bangladesh       1.0
Senegal          1.0
Qatar            1.0
Name: crit_country, dtype: float64

# To get the country of origin for each director in The Movie Database (TMDB) 

In [25]:
import requests

In [26]:
response = requests.get('https://api.themoviedb.org/3/movie/550?api_key=47842dc04785315936dfef301fd6bbc9')

In [27]:
# API-key: 47842dc04785315936dfef301fd6bbc9

In [28]:
data = response.json()

In [29]:
print(data)

{'adult': False, 'backdrop_path': '/87hTDiay2N2qWyX4Ds7ybXi9h8I.jpg', 'belongs_to_collection': None, 'budget': 63000000, 'genres': [{'id': 18, 'name': 'Drama'}], 'homepage': 'http://www.foxmovies.com/movies/fight-club', 'id': 550, 'imdb_id': 'tt0137523', 'original_language': 'en', 'original_title': 'Fight Club', 'overview': 'A ticking-time-bomb insomniac and a slippery soap salesman channel primal male aggression into a shocking new form of therapy. Their concept catches on, with underground "fight clubs" forming in every town, until an eccentric gets in the way and ignites an out-of-control spiral toward oblivion.', 'popularity': 25.87, 'poster_path': '/adw6Lq9FiC9zjYEpOqfq03ituwp.jpg', 'production_companies': [{'id': 508, 'logo_path': '/7PzJdsLGlR7oW4J0J5Xcd0pHGRg.png', 'name': 'Regency Enterprises', 'origin_country': 'US'}, {'id': 711, 'logo_path': '/tEiIH5QesdheJmDAqQwvtN60727.png', 'name': 'Fox 2000 Pictures', 'origin_country': 'US'}, {'id': 20555, 'logo_path': None, 'name': 'Taur

In [121]:
directors= df.director.unique()

In [122]:
persons = pd.read_json('person_ids_06_30_2018.json', lines=True)

In [123]:
persons.head()

Unnamed: 0,adult,id,name,popularity
0,False,54768,Turo Pajala,0.12
1,False,53836,Risto Karhula,0.0
2,False,4828,Sakari Kuosmanen,2.9e-05
3,False,27436,Costa-Gavras,8e-06
4,False,71416,Franco Solinas,0.00036


In [124]:
from time import sleep

In [32]:
#Neue leere Liste für Dir_name und place of birth
dir_birthplaces= []

In [126]:
#Counter für Info, falls request stoppt wegen überforderunung, dann weiss man, anhand (?print, 'Index:', counter) wann es passiert ist:
counter = 0

for director in directors:
    print(director, "Index:", counter)
    counter = counter+1
    try:
        
        #Bedingung: dir_name = director name
        id_row = persons[persons['name']==director].head(1)
        id = id_row['id'].values[0]
        #5 Sekunden warten zwischen Anfragen
        sleep(5)
        response = requests.get('https://api.themoviedb.org/3/person/{}?api_key=47842dc04785315936dfef301fd6bbc9&language=en-US'.format(id))
        data=response.json()
        
        #mache daraus ein dictionary:
        dictionary ={
            'name' : director,
            'place of birth' : data['place_of_birth']
        }
        
        #hänge Dict. an neue Liste an(füllen)
        dir_birthplaces.append(dictionary)
        print(" ", data['place_of_birth'])
    except:
        print("  no id found for", director)

David Lynch Index: 0
  None
Wong Kar-wai Index: 1
  Shanghai, China
Terrence Malick Index: 2
  Waco, Texas, USA
Edward Yang Index: 3
   Shanghai, China
Jean-Luc Godard Index: 4
  Paris, France
Mohammad Rasoulof Index: 5
  Shiraz, Iran
Raoul Ruiz Index: 6
  no id found for Raoul Ruiz
Abbas Kiarostami Index: 7
  Teheran, Iran
Johnnie To Index: 8
  Hong Kong
Carlos Saura Index: 9
  Huesca, Spain
Michel Gondry Index: 10
  Versailles, France
Apichatpong Weerasethakul Index: 11
  Bangkok, Thailand
Hayao Miyazaki Index: 12
  Tokyo
Joshua Oppenheimer Index: 13
  Texas, USA
Wes Anderson Index: 14
  Houston, Texas, USA
Jia Zhangke Index: 15
  Fenyang, Shanxi, China
Gus Van Sant Index: 16
  Louisville, Kentucky, USA
Kathryn Bigelow Index: 17
  San Carlos, California, USA
David Cronenberg Index: 18
  Toronto, Ontario, Canada
Sarah Polley Index: 19
  Toronto - Ontario - Canada
Martin Campbell Index: 20
  None
Miguel Gomes Index: 21
  Lisbon, Portugal
Pablo Berger Index: 22
  Bilbao, Vizcaya, País V

  Tokyo, Japan
Alain Gomis Index: 154
  None
Ken Loach Index: 155
  Nuneaton, Warwickshire, England, UK
Newton I. Aduaka Index: 156
  None
Cameron Crowe Index: 157
  Palm Springs, California, USA
Aleksey German Index: 158
  Leningrad, Soviet Union
Larry Charles Index: 159
  Brooklyn, New York, USA
Don Hertzfeldt Index: 160
  Fremont, California, USA
Alex Garland Index: 161
  London, England
Joss Whedon Index: 162
  New York City, New York, USA
Neill Blomkamp Index: 163
  Johannesburg, South Africa
Whit Stillman Index: 164
  Washington, District of Columbia, USA
Lav Diaz Index: 165
  Datu Paglas, Cotabato, Philippines
Kiyoshi Kurosawa Index: 166
  Kobe, Japan
Takahisa Zeze Index: 167
  None
David O. Russell Index: 168
  New York City, New York, USA
Susanne Bier Index: 169
  Copenhagen, Denmark
Jørgen Leth and Lars von Trier Index: 170
  no id found for Jørgen Leth and Lars von Trier
Paweł Pawlikowski Index: 171
  Warsaw, Mazowieckie, Poland
Hirokazu Koreeda Index: 172
  Tokyo, Japan
Dam

  Wellington, New Zealand
Peter Jackson Index: 317
  Pukerua Bay, North Island, New Zealand
Nouri Bouzid Index: 318
  None
Rehad Desai Index: 319
  None
Mohamed Diab Index: 320
  None
Oliver Schmitz Index: 321
  Cape Town, South Africa
Tom McCarthy Index: 322
  None
Dror Moreh Index: 323
  None
Thom Andersen Index: 324
  None
Mark Neveldine and Brian Taylor Index: 325
  no id found for Mark Neveldine and Brian Taylor
Patricio Guzmán Index: 326
  None
Víctor Erice Index: 327
  Carranza, Vizcaya, País Vasco, España
Stephen Chow Index: 328
  香港
Peter Tscherkassky Index: 329
  None
Philippe Grandrieux Index: 330
  Saint-Étienne, France
Joseph Kahn Index: 331
  Jersey Village, Texas
Rita Azevedo Gomes Index: 332
  None
Sidney Lumet Index: 333
  Philadelphia, Pennsylvania, USA
Woody Allen Index: 334
  The Bronx, New York City, New York, USA
Alain Resnais Index: 335
  Vannes, Morbihan, Bretagne, France
Manoel de Oliveira Index: 336
  Porto, Portugal
Michelangelo Antonioni Index: 337
  Ferrara

In [58]:
import datetime
timestamp = datetime.datetime.now().strftime('%Y%m%d')

df_dir_birthplaces.to_csv("dir_birthplaces_" + timestamp + '.csv', index=False)


NameError: name 'df_dir_birthplaces' is not defined

In [30]:
df_birthplace['country'] = df_birthplace.birthplace.str.extract("(\w*)\"$")
df_birthplace.head()

NameError: name 'df_birthplace' is not defined

In [37]:
df_end= pd.read_csv("dir_birthplaces.csv")
df_end.head()

Unnamed: 0,name,place of birth
0,David Lynch,
1,Wong Kar-wai,"Shanghai, China"
2,Terrence Malick,"Waco, Texas, USA"
3,Edward Yang,"Shanghai, China"
4,Jean-Luc Godard,"Paris, France"


In [38]:
df_end.shape


(365, 2)

In [39]:
df_end['country'] = df_end['place of birth'].str.extract("(\w*)$")

In [40]:
df_end

Unnamed: 0,name,place of birth,country
0,David Lynch,,
1,Wong Kar-wai,"Shanghai, China",China
2,Terrence Malick,"Waco, Texas, USA",USA
3,Edward Yang,"Shanghai, China",China
4,Jean-Luc Godard,"Paris, France",France
5,Mohammad Rasoulof,"Shiraz, Iran",Iran
6,Abbas Kiarostami,"Teheran, Iran",Iran
7,Johnnie To,Hong Kong,Kong
8,Carlos Saura,"Huesca, Spain",Spain
9,Michel Gondry,"Versailles, France",France


In [41]:
df_end['country'].value_counts(ascending=False)

USA              73
France           27
UK               20
Germany           9
Canada            9
                  7
Japan             7
Australia         6
Korea             6
England           6
China             5
Italy             5
Denmark           4
India             4
Argentina         4
Sweden            4
Zealand           3
Mexico            3
Belgium           3
Iran              3
Romania           3
Spain             3
Poland            3
Portugal          2
States            2
Russia            2
Ireland           2
Francia           2
Israel            2
Italia            2
                 ..
Kong              1
Kenya             1
Isfahan           1
US                1
SC                1
Hungary           1
Kingdom           1
Texas             1
Florida           1
Cameroon          1
Chile             1
Provinc           1
Turkey            1
Lisbona           1
Kuching           1
Brazil            1
Philippines       1
Guangdong         1
Uniti             1


In [42]:
df_end.sort_values(by="country", ascending=False)

Unnamed: 0,name,place of birth,country
289,Stephen Chow,香港,香港
68,Ang Lee,臺灣屏東縣潮州鎮,臺灣屏東縣潮州鎮
344,Jane Campion,"Wellington, New Zealand",Zealand
279,Peter Jackson,"Pukerua Bay, North Island, New Zealand",Zealand
278,Andrew Dominik,"Wellington, New Zealand",Zealand
50,Charlie Kaufman,"New York City, New York",York
52,Rick Alverson,"Richmond, Virginia",Virginia
175,Derek Cianfrance,"Lakewood, Colorado, Stati Uniti",Uniti
135,Aleksey German,"Leningrad, Soviet Union",Union
73,Ava DuVernay,"Long Beach, California, USA",USA


# Cleaning 

In [45]:
df_end['country'].replace("España", "Spain", inplace=True)

In [46]:
df_end['country'].replace("Francia", "France", inplace = True)
df_end['country'].replace("Kong", "China", inplace = True)
df_end['country'].replace("England", "UK", inplace = True)
df_end['country'].replace("Zealand", "New Zealand", inplace = True)
df_end['country'].replace("Italia", "Italy", inplace = True)
df_end['country'].replace("Texas", "USA", inplace = True)
df_end['country'].replace("California", "USA", inplace = True)
df_end['country'].replace("Lisbona", "Portugal", inplace = True)
df_end['country'].replace("Massachusetts", "USA", inplace = True)
df_end['country'].replace("US", "USA", inplace = True)
df_end['country'].replace("Virginia", "USA", inplace = True)
df_end['country'].replace("Kingdom", "UK", inplace = True)
df_end['country'].replace("Tokyo", "Japan", inplace = True)
df_end['country'].replace("States", "USA", inplace = True)
df_end['country'].replace("Florida", "USA", inplace = True)
df_end['country'].replace("Kuching", "Malaysia", inplace = True)
df_end['country'].replace("Uniti", "USA", inplace = True)
df_end['country'].replace("Isfahan", "Iran", inplace = True)
df_end['country'].replace("York", "USA", inplace = True)
df_end['country'].replace("Guangdong", "China", inplace = True)
df_end['country'].replace("Rico", "Puerto Rico", inplace = True)
df_end['country'].replace("Scotland", "UK", inplace = True)
df_end['country'].replace("SC", "USA", inplace = True)
df_end['country'].replace("Union", "Russia", inplace = True)
df_end['country'].replace("Provinc", "China", inplace = True)
df_end['country'].replace("臺灣屏東縣潮州鎮", "China", inplace = True)
df_end['country'].replace("IN", "USA", inplace = True)
df_end['country'].replace("香港", "China", inplace = True)
df_end['country'].replace("U.S", "USA", inplace = True)
df_end['country'].replace("Africa", "South Africa", inplace = True)
df_end['country'].replace("Korea", "South Korea", inplace = True)
df_end['country'].replace("Hong Kong", "China", inplace = True)

In [47]:
df_end['country'].value_counts(ascending=False)

USA             85
France          29
UK              27
China           10
Germany          9
Canada           9
Japan            8
                 7
Italy            7
Australia        6
South Korea      6
Spain            5
India            4
Sweden           4
Argentina        4
Iran             4
Denmark          4
Portugal         3
Belgium          3
Russia           3
Poland           3
New Zealand      3
Romania          3
Mexico           3
Austria          2
Ireland          2
Israel           2
South Africa     2
Chile            1
Philippines      1
Thailand         1
Lebanon          1
Hungary          1
Lithuania        1
Kenya            1
Brazil           1
Colombia         1
Finland          1
Turkey           1
Malaysia         1
Greece           1
Cameroon         1
Tunisia          1
Ecuador          1
Haiti            1
Puerto Rico      1
Name: country, dtype: int64

In [48]:
df_end.groupby('country').name.value_counts().sort_values()


country  name                            
         Christopher Guest                   1
UK       James Marsh                         1
         Joe Cornish                         1
         Jonathan Glazer                     1
         Ken Loach                           1
         Lynne Ramsay                        1
         Michael Winterbottom                1
         Mike Leigh                          1
         Paul Greengrass                     1
         Peter Mullan                        1
         Ridley Scott                        1
         Sam Mendes                          1
         Shane Meadows                       1
         Stephen Daldry                      1
         Stephen Frears                      1
         Edgar Wright                        1
         Terry George                        1
USA      Adam McKay                          1
         Alexander Payne                     1
         Andrew Stanton                      1
         Ava DuVer

In [49]:
df_end.groupby('country')['name'].nunique().sort_values(ascending=False)

country
USA             85
France          29
UK              27
China           10
Canada           9
Germany          9
Japan            8
Italy            7
                 7
Australia        6
South Korea      6
Spain            5
Iran             4
Argentina        4
Denmark          4
Sweden           4
India            4
Russia           3
Portugal         3
Poland           3
New Zealand      3
Mexico           3
Belgium          3
Romania          3
Austria          2
Israel           2
Ireland          2
South Africa     2
Puerto Rico      1
Philippines      1
Malaysia         1
Lithuania        1
Brazil           1
Cameroon         1
Turkey           1
Chile            1
Tunisia          1
Colombia         1
Lebanon          1
Ecuador          1
Finland          1
Thailand         1
Kenya            1
Greece           1
Haiti            1
Hungary          1
Name: name, dtype: int64

In [50]:
df_groupby_country = df_end.groupby('country')

In [51]:
df_end

Unnamed: 0,name,place of birth,country
0,David Lynch,,
1,Wong Kar-wai,"Shanghai, China",China
2,Terrence Malick,"Waco, Texas, USA",USA
3,Edward Yang,"Shanghai, China",China
4,Jean-Luc Godard,"Paris, France",France
5,Mohammad Rasoulof,"Shiraz, Iran",Iran
6,Abbas Kiarostami,"Teheran, Iran",Iran
7,Johnnie To,Hong Kong,China
8,Carlos Saura,"Huesca, Spain",Spain
9,Michel Gondry,"Versailles, France",France


In [52]:
properties_list = []

for country_name, group in df_groupby_country:
 
# ´group` hier ist wiederum ein eigener, kleiner Data Frame mit nur den 
# Zeilen, die der Gruppe entsprechen. Du kannst mit diesem DF so verfahren 
# wie mit anderen DataFrames auch, zum Beispiel ...
 
    # Herausfinden, wie viele Zeilen der Data Frame hat
    amount_of_directors = len(group) 
 
    # Oder alle Texte innerhalb einer bestimmten Spalte mit Kommas 
    # aneinanderreihen,
    directors_name = '<br>'.join(group['name']) # gibt "Element 1, Element 2" ...
 
    title = amount_of_directors
 
# Zu einem Dictionary zusammensetzen ...
    dictionary = {
    'properties.headline': str(amount_of_directors) + " directors",
    'properties.amount_of_dir': str(amount_of_directors),
    'properties.article': directors_name, 
    'properties.name': country_name,
    'properties.color': '#333333'
    }
 
# Zur Liste anfügen
    properties_list.append(dictionary)
    


In [53]:
properties_list

[{'properties.headline': '7 directors',
  'properties.amount_of_dir': '7',
  'properties.article': 'Mariano Llinás<br>Miranda July<br>Travis Wilkerson<br>Frederick Wiseman<br>Paolo Sorrentino<br>Robert Zemeckis<br>Christopher Guest',
  'properties.name': '',
  'properties.color': '#333333'},
 {'properties.headline': '4 directors',
  'properties.amount_of_dir': '4',
  'properties.article': 'Lucrecia Martel<br>Juan José Campanella<br>Gaspar Noé<br>Fabián Bielinsky',
  'properties.name': 'Argentina',
  'properties.color': '#333333'},
 {'properties.headline': '6 directors',
  'properties.amount_of_dir': '6',
  'properties.article': 'Baz Luhrmann<br>Jennifer Kent<br>George Miller<br>David Michôd<br>Peter Weir<br>Warwick Thornton',
  'properties.name': 'Australia',
  'properties.color': '#333333'},
 {'properties.headline': '2 directors',
  'properties.amount_of_dir': '2',
  'properties.article': 'Jessica Hausner<br>Ulrich Seidl',
  'properties.name': 'Austria',
  'properties.color': '#333333

In [54]:
df_properties=pd.DataFrame(properties_list)
df_properties

Unnamed: 0,properties.amount_of_dir,properties.article,properties.color,properties.headline,properties.name
0,7,Mariano Llinás<br>Miranda July<br>Travis Wilke...,#333333,7 directors,
1,4,Lucrecia Martel<br>Juan José Campanella<br>Gas...,#333333,4 directors,Argentina
2,6,Baz Luhrmann<br>Jennifer Kent<br>George Miller...,#333333,6 directors,Australia
3,2,Jessica Hausner<br>Ulrich Seidl,#333333,2 directors,Austria
4,3,Chantal Akerman<br>Bouli Lanners<br>Agnès Varda,#333333,3 directors,Belgium
5,1,Kleber Mendonça Filho,#333333,1 directors,Brazil
6,1,Jean-Marie Téno,#333333,1 directors,Cameroon
7,9,David Cronenberg<br>Sarah Polley<br>Mary Harro...,#333333,9 directors,Canada
8,1,Pablo Larraín,#333333,1 directors,Chile
9,10,Wong Kar-wai<br>Edward Yang<br>Johnnie To<br>J...,#333333,10 directors,China


In [55]:
df_properties.to_csv("BBC_properties.csv", index=False)


In [56]:
out_properties = pd.read_csv("BBC_properties.csv")
out_properties

Unnamed: 0,properties.amount_of_dir,properties.article,properties.color,properties.headline,properties.name
0,7,Mariano Llinás<br>Miranda July<br>Travis Wilke...,#333333,7 directors,
1,4,Lucrecia Martel<br>Juan José Campanella<br>Gas...,#333333,4 directors,Argentina
2,6,Baz Luhrmann<br>Jennifer Kent<br>George Miller...,#333333,6 directors,Australia
3,2,Jessica Hausner<br>Ulrich Seidl,#333333,2 directors,Austria
4,3,Chantal Akerman<br>Bouli Lanners<br>Agnès Varda,#333333,3 directors,Belgium
5,1,Kleber Mendonça Filho,#333333,1 directors,Brazil
6,1,Jean-Marie Téno,#333333,1 directors,Cameroon
7,9,David Cronenberg<br>Sarah Polley<br>Mary Harro...,#333333,9 directors,Canada
8,1,Pablo Larraín,#333333,1 directors,Chile
9,10,Wong Kar-wai<br>Edward Yang<br>Johnnie To<br>J...,#333333,10 directors,China
