## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Wednesday of this week. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies then immediately follow her/him. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just email me and I will send you the code.

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, email me and I will send you working code so you can move on to the next step.


### REMEMBER: secondary source
Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.


### Getting started: Data Architecture
You can come up with your own data scheme for this, but the one I'm recommending is three separate lists:

The most challenging one is the **critics_list**:

`critics_list = [['critic name','critic organization','critic country','movie one name','movie two name','movie three name',etc],['critic name','critic organization','critic country','movie one','movie two','movie three',etc]]`

So each list would contain 13 elements -- three entries about the critic, and then the 10 movies picked. critics_list[0][3] would be the first critic's #1 movie, critics_list[2][12] would be the third critic's  #10 movie.

Next, you would make  **"movie_list"** which would look like this:

`movie_list = [['movie name','director name','movie date'],['movie name','director name','movie date']]`

Just go through the whole page and make a list of lists for every movie. Each list would contain three elements. movie_list[0][0] what give you the name of the first movie in the list, movie_list[3][1] would give you the director of the fourth movie in the list.

Finally, you would need make a simple **directors_list**.

director_list = ['Director name','Director name']

director_list[0] would give you first director.


### Time for code: 

The first thing you need to do is import beautiful soup & urllib like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [174]:
##Import your libraries: Beautiful soup, urllib, and re (For regular expressions)
from urllib.request import urlopen
from bs4 import BeautifulSoup 
import re
import pandas as pd

In [175]:
# read the URL, and put the HTML page into beautiful soup
my_url = "http://floatingmedia.com/columbia/BBC.html"
raw_html = urlopen(my_url).read()
soup_doc = BeautifulSoup(raw_html, "html.parser")

In [176]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 
soup_doc.find_all('div')

[<div class="bbccom_display_none" id="bbccom_interstitial_ad"></div>,
 <div class="bbccom_display_none" id="bbccom_interstitial"><script type="text/javascript"> /*<![CDATA[*/ (function() { if (window.bbcdotcom && bbcdotcom.config.isActive('ads')) { googletag.cmd.push(function() { googletag.display('bbccom_interstitial'); }); } }()); /*]]>*/ </script></div>,
 <div class="bbccom_display_none" id="bbccom_wallpaper_ad"></div>,
 <div class="bbccom_display_none" id="bbccom_wallpaper"><script type="text/javascript"> /*<![CDATA[*/ (function() { var wallpaper; if (window.bbcdotcom && bbcdotcom.config.isActive('ads')) { if (bbcdotcom.config.isAsync()) { googletag.cmd.push(function() { googletag.display('bbccom_wallpaper'); }); } else { googletag.display("wallpaper"); } wallpaper = bbcdotcom.adverts.adRegister.getAd('wallpaper'); } }()); /*]]>*/ </script></div>,
 <div id="blq-global"> <div id="blq-pre-mast"> </div> </div>,
 <div id="blq-pre-mast"> </div>,
 <div class="orb-nav-pri orb-nav-pri-whit

In [177]:
info = soup_doc.find('div',class_='body-content')
info

<div class="body-content">
<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p><p><strong>Simon Abrams – Freelance film critic (US)</strong></p><p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Ra

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [114]:
#find_all

In [178]:
all_p = info.find_all('p')
all_p

[<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p>,
 <p><strong>Simon Abrams – Freelance film critic (US)</strong></p>,
 <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7.

In [116]:
all_p[1].strong.string

'Simon Abrams – Freelance film critic (US)'

**STEP THREE** This is where all the magic has to happen: you need to find a way to look through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step email me with your code!



In [179]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.strong.text
        movie_info = lines.next_sibling.find_all(string=True)
        print(critic_info)
        print(movie_info)

Simon Abrams – Freelance film critic (US)
['1. Mulholland Drive (David Lynch, 2001)', '2. In the Mood for Love (Wong Kar-wai, 2000)', '3. The Tree of Life (Terrence Malick, 2011)', '4. Yi Yi: A One and a Two (Edward Yang, 2000)', '5. Goodbye to Language (Jean-Luc Godard, 2014)', '6. The White Meadows (Mohammad Rasoulof, 2009)', '7. Night Across the Street (Raoul Ruiz, 2012)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. Sparrow (Johnnie To, 2008)', '10. Fados (Carlos Saura, 2007)']
Sam Adams – Freelance film critic (US)
['1. In the Mood for Love (Wong Kar-wai, 2000)', '2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)', '3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)', '4. Spirited Away (Hayao Miyazaki, 2001)', '5. The Act of Killing (Joshua Oppenheimer, 2012)', '6. The Grand Budapest Hotel (Wes Anderson, 2014)', '7. The New World (Terrence Malick, 2004)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. The World (Jia Zhangke, 2004)', '10. Elephant (Gu

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [180]:
#Practice/Build your regular expressions here
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^(.+)\W.\W"
regex_for_org = r"\W.\W(.+) "
regex_for_cn = r"\(.+"
name = re.findall(regex_for_name,crit_sample)
org = re.findall(regex_for_org,crit_sample)
cn = re.findall(regex_for_cn,crit_sample)
name[0]

'Arturo Aguilar'

In [181]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.strong.text
        regex_for_name = r"^(.+)\W.\W"
        regex_for_org = r"\W.\W(.+) "
        regex_for_cn = r"\(.+"
        name = re.findall(regex_for_name,critic_info)
        org = re.findall(regex_for_org,critic_info)
        cn = re.findall(regex_for_cn,critic_info)
        print(critic_info)

Simon Abrams – Freelance film critic (US)
Sam Adams – Freelance film critic (US)
Thelma Adams – Freelance film critic (US)
Arturo Aguilar – Rolling Stone Mexico (Mexico)
Matthew Anderson – BBC Culture (UK)
Tim Appelo – The Wrap (US)
Adriano Aprà – Film historian (Italy)
Michael Arbeiter – Nerdist (US)
Ali Arikan – Dipnot TV (Turkey)
Michael Atkinson – The Village Voice (US)
Ana Maria Bahiana – Freelance film critic (Brazil)
Cameron Bailey – Toronto Film Festival (Canada)
Lindsay Baker – BBC Culture (UK)
Miriam Bale – Freelance film critic (US)
Nicholas Barber – BBC Culture (UK)
Diego Batlle – La Nacion (Argentina)
NT Binh – Positif (France)
Lizelle Bisschoff – University of Glasgow (UK)
Christian Blauvelt – BBC Culture (US)
Mahen Bonetti – African Film Festival Inc (US)
Andreas Borcholte – Spiegel Online (Germany)
Utpal Borpujari – Freelance film critic (India)
Richard Brody – The New Yorker (US)
Hannah Brown – Jerusalem Post (Israel)
Luke Buckmaster – The Guardian/BBC Culture (Austral

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [182]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie
for lines in all_p:
    if lines.strong is not None:
        movie_info = lines.next_sibling.find_all(string=True)
        print(movie_info)

['1. Mulholland Drive (David Lynch, 2001)', '2. In the Mood for Love (Wong Kar-wai, 2000)', '3. The Tree of Life (Terrence Malick, 2011)', '4. Yi Yi: A One and a Two (Edward Yang, 2000)', '5. Goodbye to Language (Jean-Luc Godard, 2014)', '6. The White Meadows (Mohammad Rasoulof, 2009)', '7. Night Across the Street (Raoul Ruiz, 2012)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. Sparrow (Johnnie To, 2008)', '10. Fados (Carlos Saura, 2007)']
['1. In the Mood for Love (Wong Kar-wai, 2000)', '2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)', '3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)', '4. Spirited Away (Hayao Miyazaki, 2001)', '5. The Act of Killing (Joshua Oppenheimer, 2012)', '6. The Grand Budapest Hotel (Wes Anderson, 2014)', '7. The New World (Terrence Malick, 2004)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. The World (Jia Zhangke, 2004)', '10. Elephant (Gus Van Sant, 2003)']
['1. Zero Dark Thirty (Kathryn Bigelow, 2012)', '2. A History

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [183]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"^\d.\W(.*)\W\("
movie_name = re.findall(regex_for_mname,movie_sample)
movie_name[0]

'Zero Dark Thirty'

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the move name.

So new the entire loop should be getting you 13 elements:
critic_name
critic_org
critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
movie_name


In [184]:
#Get that loop working here
critic_list = []
for lines in all_p[:-3]:
    if lines.strong is not None:
        critic_info = lines.strong.text
        critic_dic = {}
        regex_for_name = r"^(.+)\W.\W"
        regex_for_org = r"\W.\W(.+) "
        regex_for_cn = r"\(.+"
        name = re.findall(regex_for_name,critic_info)
        org = re.findall(regex_for_org,critic_info)
        cn = re.findall(regex_for_cn,critic_info)
        critic_dic['critic_name'] = name[0]
        critic_dic['critic_org'] = org[0]
        critic_dic['critic_cn'] = cn[0]
        if (critic_dic is critic_list) == False:
            critic_list.append(critic_dic)

In [185]:
critic_list

[{'critic_cn': '(US)',
  'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic'},
 {'critic_cn': '(US)',
  'critic_name': 'Sam Adams',
  'critic_org': 'Freelance film critic'},
 {'critic_cn': '(US)',
  'critic_name': 'Thelma Adams',
  'critic_org': 'Freelance film critic'},
 {'critic_cn': '(Mexico)',
  'critic_name': 'Arturo Aguilar',
  'critic_org': 'Rolling Stone Mexico'},
 {'critic_cn': '(UK)',
  'critic_name': 'Matthew Anderson',
  'critic_org': 'BBC Culture'},
 {'critic_cn': '(US)', 'critic_name': 'Tim Appelo', 'critic_org': 'The Wrap'},
 {'critic_cn': '(Italy)',
  'critic_name': 'Adriano Aprà',
  'critic_org': 'Film historian'},
 {'critic_cn': '(US)',
  'critic_name': 'Michael Arbeiter',
  'critic_org': 'Nerdist'},
 {'critic_cn': '(Turkey)',
  'critic_name': 'Ali Arikan',
  'critic_org': 'Dipnot TV'},
 {'critic_cn': '(US)',
  'critic_name': 'Michael Atkinson',
  'critic_org': 'The Village Voice'},
 {'critic_cn': '(Brazil)',
  'critic_name': 'Ana Maria Bahiana',
  

In [186]:
movie_list = []
for lines in all_p[:-3]:
    if lines.strong is not None:
        movie_info = lines.next_sibling.find_all(string=True)
        for movie in movie_info:
            movie_dic = {}
            regex_for_mname = r"^\d.\W(.*)\W\("
            regex_for_dname = r"\((.*)\,"
            regex_for_year = r"\d.\d."
            movie_name = re.findall(regex_for_mname,movie)
            movie_director = re.findall(regex_for_dname,movie)
            movie_year = re.findall(regex_for_year,movie)
            movie_dic['movie_name'] = movie_name[0]
            movie_dic['movie_director'] = movie_director[0]
            movie_dic['movie_year'] = movie_year[0]
            movie_list.append(movie_dic)

In [187]:
len(movie_list)

1770

In [188]:
movie_list

[{'movie_director': 'David Lynch',
  'movie_name': 'Mulholland Drive',
  'movie_year': '2001'},
 {'movie_director': 'Wong Kar-wai',
  'movie_name': 'In the Mood for Love',
  'movie_year': '2000'},
 {'movie_director': 'Terrence Malick',
  'movie_name': 'The Tree of Life',
  'movie_year': '2011'},
 {'movie_director': 'Edward Yang',
  'movie_name': 'Yi Yi: A One and a Two',
  'movie_year': '2000'},
 {'movie_director': 'Jean-Luc Godard',
  'movie_name': 'Goodbye to Language',
  'movie_year': '2014'},
 {'movie_director': 'Mohammad Rasoulof',
  'movie_name': 'The White Meadows',
  'movie_year': '2009'},
 {'movie_director': 'Raoul Ruiz',
  'movie_name': 'Night Across the Street',
  'movie_year': '2012'},
 {'movie_director': 'Abbas Kiarostami',
  'movie_name': 'Certified Copy',
  'movie_year': '2010'},
 {'movie_director': 'Johnnie To',
  'movie_name': 'Sparrow',
  'movie_year': '2008'},
 {'movie_director': 'Carlos Saura',
  'movie_name': ' Fados',
  'movie_year': '2007'},
 {'movie_director': '

In [189]:
df_m = pd.DataFrame(movie_list)
df_m.to_csv('movie_list.csv', index=False)
df_m = pd.read_csv('movie_list.csv')
#df.drop_duplicates(subset='movie_name', keep='first', inplace=True)

In [190]:
df_m

Unnamed: 0,movie_director,movie_name,movie_year
0,David Lynch,Mulholland Drive,2001
1,Wong Kar-wai,In the Mood for Love,2000
2,Terrence Malick,The Tree of Life,2011
3,Edward Yang,Yi Yi: A One and a Two,2000
4,Jean-Luc Godard,Goodbye to Language,2014
5,Mohammad Rasoulof,The White Meadows,2009
6,Raoul Ruiz,Night Across the Street,2012
7,Abbas Kiarostami,Certified Copy,2010
8,Johnnie To,Sparrow,2008
9,Carlos Saura,Fados,2007


In [191]:
df_m.drop_duplicates(subset=None, keep='first', inplace=True)
df_m

Unnamed: 0,movie_director,movie_name,movie_year
0,David Lynch,Mulholland Drive,2001
1,Wong Kar-wai,In the Mood for Love,2000
2,Terrence Malick,The Tree of Life,2011
3,Edward Yang,Yi Yi: A One and a Two,2000
4,Jean-Luc Godard,Goodbye to Language,2014
5,Mohammad Rasoulof,The White Meadows,2009
6,Raoul Ruiz,Night Across the Street,2012
7,Abbas Kiarostami,Certified Copy,2010
8,Johnnie To,Sparrow,2008
9,Carlos Saura,Fados,2007


**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists that contains the 13 elements: 3 things about the critic and the 10 movies she/he selected.

In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [130]:
#critics_list = []
#for loop that goes throug all the <p> element
    #if strong (begins with the critic)
#        this_critic = []
        #critic_info= get the critic line
        #critic_name = re.findall(regex,critic_info)
        #critic_org = re.findall(regex,critic_info)
        #critic_cn = re.findall(regex,critic_info)
  #      this_critic.append(critic_name[0],critic_org[0],critic_cn[0])
        #movie_info = get movie line using next_sibling
        #get each movie string
        #loop through each movie_line (#1 through #10)
            #movie_name = re.findall(regex,movie_line)
  #          this_critic.append(movie_name[0])
            #this append will happen 10 times
        #The list for the single criticIs finished
        #Add it to the critics list
   #     critics_list.append(this_critic)
            

    

In [192]:
critic_list = []
for lines in all_p[:-3]:
    if lines.strong is not None:
        critic_info = lines.strong.text
        critic_dic = {}
        regex_for_name = r"^(.+)\W.\W"
        regex_for_org = r"\W.\W(.+) "
        regex_for_cn = r"\(.+"
        name = re.findall(regex_for_name,critic_info)
        org = re.findall(regex_for_org,critic_info)
        cn = re.findall(regex_for_cn,critic_info)
        critic_dic['critic_name'] = name[0]
        critic_dic['critic_org'] = org[0]
        critic_dic['critic_cn'] = cn[0]
        movie_info = lines.next_sibling.find_all(string=True)
        count = 1
        regex_for_mname = r"^\d*.\W(.*)\W\("
        for movie in movie_info:
            movie_name = re.findall(regex_for_mname,movie)
            critic_dic[count] = movie_name[0]
            count += 1
        critic_list.append(critic_dic)

In [193]:
##Take a peek at your final lists of lists
critic_list

[{'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic',
  'critic_cn': '(US)',
  1: 'Mulholland Drive',
  2: 'In the Mood for Love',
  3: 'The Tree of Life',
  4: 'Yi Yi: A One and a Two',
  5: 'Goodbye to Language',
  6: 'The White Meadows',
  7: 'Night Across the Street',
  8: 'Certified Copy',
  9: 'Sparrow',
  10: 'Fados'},
 {'critic_name': 'Sam Adams',
  'critic_org': 'Freelance film critic',
  'critic_cn': '(US)',
  1: 'In the Mood for Love',
  2: 'Eternal Sunshine of the Spotless Mind',
  3: 'Syndromes and a Century',
  4: 'Spirited Away',
  5: 'The Act of Killing',
  6: 'The Grand Budapest Hotel',
  7: 'The New World',
  8: 'Certified Copy',
  9: 'The World',
  10: 'Elephant'},
 {'critic_name': 'Thelma Adams',
  'critic_org': 'Freelance film critic',
  'critic_cn': '(US)',
  1: 'Zero Dark Thirty',
  2: 'A History of Violence',
  3: 'The Grand Budapest Hotel',
  4: 'Stories We Tell',
  5: 'Casino Royale',
  6: 'Eternal Sunshine of the Spotless Mind',
  7: 'Tabu

In [194]:
df = pd.DataFrame(critic_list)
df.to_csv('critic_list.csv', index=False)
pd.read_csv('critic_list.csv')

Unnamed: 0,critic_cn,critic_name,critic_org,1,2,3,4,5,6,7,8,9,10
0,(US),Simon Abrams,Freelance film critic,Mulholland Drive,In the Mood for Love,The Tree of Life,Yi Yi: A One and a Two,Goodbye to Language,The White Meadows,Night Across the Street,Certified Copy,Sparrow,Fados
1,(US),Sam Adams,Freelance film critic,In the Mood for Love,Eternal Sunshine of the Spotless Mind,Syndromes and a Century,Spirited Away,The Act of Killing,The Grand Budapest Hotel,The New World,Certified Copy,The World,Elephant
2,(US),Thelma Adams,Freelance film critic,Zero Dark Thirty,A History of Violence,The Grand Budapest Hotel,Stories We Tell,Casino Royale,Eternal Sunshine of the Spotless Mind,Tabu,Snow White,Frozen River,Gosford Park
3,(Mexico),Arturo Aguilar,Rolling Stone Mexico,In the Mood for Love,Mulholland Drive,Inception,Pan's Labyrinth,Caché,Grizzly Man,"4 Months, 3 Weeks & 2 Days",Holy Motors,The Last of the Unjust,There Will Be Blood
4,(UK),Matthew Anderson,BBC Culture,The Piano Teacher,Margaret,American Psycho,"4 Months, 3 Weeks & 2 Days",Caché,Mulholland Drive,Lourdes,Red Road,Boyhood,Tony Manero
5,(US),Tim Appelo,The Wrap,No Country For Old Men,Spirited Away,A Separation,Pan's Labyrinth,Finding Nemo,Hero,The Wolf of Wall Street,Mother,The Bourne Ultimatum,Traffic
6,(Italy),Adriano Aprà,Film historian,These Encounters of Theirs,Vincere,Le quattro volte,The Profession of Arms,Gostanza da Libbiano,Storia di una donna amata e di un assassino ge...,At the First Breath of Wind,Sangue,Terra,Oh! Man
7,(US),Michael Arbeiter,Nerdist,"Synecdoche, New York",Mulholland Drive,Eternal Sunshine of the Spotless Mind,Black Swan,The Comedy,Melancholia,Inglourious Basterds,Inside Llewyn Davis,Only Lovers Left Alive,The Congress
8,(Turkey),Ali Arikan,Dipnot TV,The Master,25th Hour,24 Hour Party People,Mulholland Drive,Once Upon a Time in Anatolia,Zodiac,A Serious Man,Tinker Tailor Soldier Spy,Primer,In the Mood for Love
9,(US),Michael Atkinson,The Village Voice,"La commune (Paris, 1871)",Uncle Boonmee Who Can Recall His Past Lives,2046,Adaptation,Battle in Heaven,Caché,Inland Empire,Mulholland Drive,Mysteries of Lisbon,Eternal Sunshine of the Spotless Mind


In [195]:
director_list = []
for lines in all_p[:-3]:
    if lines.strong is not None:
        movie_info = lines.next_sibling.find_all(string=True)
        director_dic = {}
        for movie in movie_info:
            regex_for_dname = r"\((.*)\,"
            movie_director = re.findall(regex_for_dname,movie)
            movie_dic['director'] = movie_director[0]
            if (movie_dic in movie_list) == False:
                director_list.append(director_dic)

In [196]:
movies_list = []
for lines in all_p[:-3]:
    if lines.strong is not None:
        critic_info = lines.strong.text
        critic_dic = {}
        regex_for_name = r"^(.+)\W.\W"
        regex_for_org = r"\W.\W(.+) "
        regex_for_cn = r"\(.+"
        name = re.findall(regex_for_name,critic_info)
        org = re.findall(regex_for_org,critic_info)
        cn = re.findall(regex_for_cn,critic_info)
        movie_info = lines.next_sibling.find_all(string=True)
        count = 1
        for movie in movie_info:
            movie_dic = {}
            regex_for_mname = r"^\d+.\W(.*)\W\("
            regex_for_dname = r"\((.*)\,"
            regex_for_year = r"\d.\d."
            movie_name = re.findall(regex_for_mname,movie)
            movie_director = re.findall(regex_for_dname,movie)
            movie_year = re.findall(regex_for_year,movie)
            movie_dic['critic_name'] = name[0]
            movie_dic['critic_org'] = org[0]
            movie_dic['critic_cn'] = cn[0]
            movie_dic['movie_name'] = movie_name[0]
            movie_dic['movie_director'] = movie_director[0]
            movie_dic['movie_year'] = movie_year[0]
            movie_dic['count'] = count
            count += 1
            movies_list.append(movie_dic)

In [197]:
movies_list

[{'count': 1,
  'critic_cn': '(US)',
  'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic',
  'movie_director': 'David Lynch',
  'movie_name': 'Mulholland Drive',
  'movie_year': '2001'},
 {'count': 2,
  'critic_cn': '(US)',
  'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic',
  'movie_director': 'Wong Kar-wai',
  'movie_name': 'In the Mood for Love',
  'movie_year': '2000'},
 {'count': 3,
  'critic_cn': '(US)',
  'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic',
  'movie_director': 'Terrence Malick',
  'movie_name': 'The Tree of Life',
  'movie_year': '2011'},
 {'count': 4,
  'critic_cn': '(US)',
  'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic',
  'movie_director': 'Edward Yang',
  'movie_name': 'Yi Yi: A One and a Two',
  'movie_year': '2000'},
 {'count': 5,
  'critic_cn': '(US)',
  'critic_name': 'Simon Abrams',
  'critic_org': 'Freelance film critic',
  'movie_director': 'Jean-Luc Godard',
  'movi

In [198]:
import pandas as pd

In [199]:
df = pd.DataFrame(movies_list)
df.to_csv('movies_list.csv', index=False)

In [200]:
df

Unnamed: 0,count,critic_cn,critic_name,critic_org,movie_director,movie_name,movie_year
0,1,(US),Simon Abrams,Freelance film critic,David Lynch,Mulholland Drive,2001
1,2,(US),Simon Abrams,Freelance film critic,Wong Kar-wai,In the Mood for Love,2000
2,3,(US),Simon Abrams,Freelance film critic,Terrence Malick,The Tree of Life,2011
3,4,(US),Simon Abrams,Freelance film critic,Edward Yang,Yi Yi: A One and a Two,2000
4,5,(US),Simon Abrams,Freelance film critic,Jean-Luc Godard,Goodbye to Language,2014
5,6,(US),Simon Abrams,Freelance film critic,Mohammad Rasoulof,The White Meadows,2009
6,7,(US),Simon Abrams,Freelance film critic,Raoul Ruiz,Night Across the Street,2012
7,8,(US),Simon Abrams,Freelance film critic,Abbas Kiarostami,Certified Copy,2010
8,9,(US),Simon Abrams,Freelance film critic,Johnnie To,Sparrow,2008
9,10,(US),Simon Abrams,Freelance film critic,Carlos Saura,Fados,2007


In [204]:
df['movie_name'].value_counts().head()

In the Mood for Love    49
Mulholland Drive        47
There Will Be Blood     35
Spirited Away           34
Boyhood                 30
Name: movie_name, dtype: int64

In [140]:
array_movie = df['movie_name'].unique()
array_movie.sort()
array_movie

array(['12 Years a Slave', '2046', '24 Hour Party People', '25th Hour',
       '3-Iron', '35 Shots of Rum', '4 Months, 3 Weeks & 2 Days',
       '5 Broken Cameras', '5x2', '678', '7 Letters', '99 Homes',
       'A Borrowed Identity', 'A Commuter’s Life (What a Life!)',
       'A Girl Walks Home Alone at Night', 'A History of Violence',
       'A Letter to Nelson Mandela',
       'A Pigeon Sat on a Branch Reflecting on Existence', 'A Prophet',
       'A Separation', 'A Serious Man', 'A Single Man',
       'A Tale of Two Sisters', 'A Time for Drunken Horses',
       'A Touch of Sin', 'A Vingança de Uma Mulher',
       'AI: Artificial Intelligence', 'About Elly', 'Actress',
       'Adaptation', 'Adventureland', 'After the Wedding', 'Ajami',
       'Alexandre Sokurov', 'All Divided Selves', "Almayer's Folly",
       'Almost Famous', 'Alpha Dog', 'American Hustle', 'American Psycho',
       'American Splendor', 'Amores Perros', 'Amour', 'Amélie',
       'An Injury to One', 'Anchorman: The L

In [141]:
len(array_movie)

598

In [142]:
array = df['movie_director'].unique()
array.sort()
array

array(['Abbas Kiarostami', 'Abdellatif Kechiche', 'Abderrahmane Sissako',
       'Adam Curtis', 'Adam McKay', 'Agnieszka Holland', 'Agnès Jaoui',
       'Agnès Varda', 'Aki Kaurismäki', 'Alain Cavalier', 'Alain Gomis',
       'Alain Guiraudie', 'Alain Resnais', 'Albert Serra',
       'Alejandro González Iñárritu', 'Aleksandr Sokurov',
       'Aleksey Fedorchenko', 'Aleksey German', 'Alex Garland',
       'Alexander Payne', 'Alfonso Cuarón', 'Amma Asante',
       'Ana Lily Amirpour', 'Andrea Arnold',
       'Andrew Adamson and Vicky Jenson', 'Andrew Dominik',
       'Andrew Dosunmu', 'Andrew Haigh', 'Andrew Lau and Alan Mak',
       'Andrew Stanton', 'Andrew Stanton and Lee Unkrich',
       'Andrey Zvyagintsev', 'Andrzej Wajda', 'Andrzej Zulawski',
       'André Singer', 'Ang Lee', 'Annemarie Jacir',
       'Anthony and Joe Russo', 'Anurag Kashyap',
       'Apichatpong Weerasethakul', 'Ari Folman', 'Arnaud Desplechin',
       'Asghar Farhadi', 'Ashutosh Gowariker', 'Asif Kapadia',
     

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [143]:
nonspecial_directors = ['Abbas Kiarostami', 'Abdellatif Kechiche', 'Abderrahmane Sissako',
       'Adam Curtis', 'Adam McKay', 'Agnieszka Holland', 'Alain Cavalier', 'Alain Gomis',
       'Alain Guiraudie', 'Alain Resnais', 'Albert Serra', 'Aleksandr Sokurov',
       'Aleksey Fedorchenko', 'Aleksey German', 'Alex Garland',
       'Alexander Payne', 'Amma Asante', 'Andrea Arnold',
       'Andrew Adamson', 'Andrew Dominik', 'Andrew Lau','Andrew Stanton', 
       'Andrey Zvyagintsev', 'Andrzej Wajda', 'Andrzej Zulawski',
       'Ang Lee', 'Annemarie Jacir', 'Joe Russo', 'Anurag Kashyap',
       'Apichatpong Weerasethakul', 'Ari Folman', 'Arnaud Desplechin',
       'Asghar Farhadi', 'Ashutosh Gowariker', 'Asif Kapadia',
       'Ava DuVernay', 'Avi Nesher', 'Bahman Ghobadi', 
       'Baz Luhrmann', 'Ben Rivers', 'Ben Stiller',
       'Ben Wheatley', 'Benh Zeitlin', 'Bong Joon-ho','Boo Junfeng',
       'Bouli Lanners', 'Brad Bird', 'Jan Pinkava',
       'Brian De Palma', 'Cameron Crowe',
       'Carlos Reygadas', 'Carlos Saura', 'Carol Morley',
       'Cary Joji Fukunaga', 'Catherine Breillat', 
       'Chantal Akerman', 'Charles Burnett', 'Charlie Kaufman',
       'Chris Buck', 'Christian Petzold', 'Christopher Guest', 'Christopher Morris',
       'Christopher Nolan', 'Claire Denis', 'Claude Lanzmann',
       'Clint Eastwood', 'Corneliu Porumboiu',
       'Courtney Hunt', 'Cristi Puiu', 'Cristian Mungiu', 
       'Damien Chazelle', 'Dan Gilroy', 'Danis Tanovic',
       'Jean-Marie Straub', 'Darren Aronofsky',
       'David Cronenberg', 'David Fincher', 'David Lynch',
        "David O'Reilly", 'David O. Russell', 'David Wain',
       'Debra Granik', 'Denis Villeneuve', 'Derek Cianfrance',
       'Destin Daniel Cretton', 'Dibakar Banerjee', 'Don Hertzfeldt',
       'Duke Johnson', 'Duncan Jones',
       'Edgar Wright', 'Edward Yang', 'Elia Suleiman',
       'Eran Riklis', 'Ermanno Olmi', 'Eytan Fox', 
       'Fatih Akin', 'Fernando Meirelles',
       'Florian Henckel von Donnersmarck', 'Francis Ford Coppola',
       'Franco Piavoli', 'Frederick Wiseman',
       'George Clooney', 'George Lucas', 'George Miller',
       'Gerhard Benedikt Friedl', 'Gina Prince-Bythewood', 'Greg Mottola',
       'Guillermo Del Toro', 'Gurinder Chadha', 'Gus Van Sant',
       'Guy Maddin', 'Hany Abu-Assad', 'Harmony Korine', 'Hayao Miyazaki',
       'Hirokazu Koreeda', 'Hong Sang-soo', 'Hou Hsiao-hsien',
       'Ingmar Bergman', 'J. A. Bayona', 'J. J. Abrams', 
       'Jacques Audiard', 'Jacques Rivette', 'Jafar Panahi',
       'James Cameron', 'James Gray',  'James Marsh',
       'Jane Campion', 'Janie Geiser', 'Jason Reitman',
       'Jean-Charles Fitoussi', 'Jean-Claude Brisseau',
       'Jean-Luc Godard', 
       'Jean-Pierre Jeunet',
       'Jean-Pierre and Luc Dardenne', 'Jeff Nichols', 
       'Jessica Hausner', 'Zhangke Jia', 'Jiang Wen',
       'Jim Jarmusch', 'Joanna Hogg', 'Joe Cornish',
       'John Akomfrah', 'John Cameron Mitchell',
       'John Carney', 'John Crowley', 'Johnnie To',
       'Jonas Mekas', 'Jonathan Glazer', 'Joseph Cedar', 'Joseph Kahn',
       'Joshua Oppenheimer', 'Joss Whedon', 'Julian Schnabel', 'Lars von Trier',
       'Marcelo Gomes', 'Kathryn Bigelow', 'Ken Jacobs', 'Ken Loach', 'Kenneth Lonergan',
       'Kevin Costner', 'Kim Jee-woon', 'Kim Ki-duk',
       'Kinji Fukasaku', 'Kiyoshi Kurosawa', 
       'Larry Charles', 'Lars von Trier', 'Laurent Cantet',
       'Laurie Anderson', 'Lav Diaz', 'Lee Chang-dong', 'Lee Unkrich',
       'Leos Carax', 'Luca Guadagnino', 'Lucrecia Martel',
       'Luigi M. Faccini', 'Lukas Moodysson', 
       'Lynne Ramsay', 'Mahamat-Saleh Haroun', 'Manoel de Oliveira', 'Marco Bellocchio',
       'Maren Ade', 'Mark Neveldine',
       'Martin Campbell', 'Martin Scorsese', 'Mary Harron', 'Mel Gibson',
       'Michael Bay', 'Michael Haneke', 'Michael Mann',
       'Michael Moore', 'Michael Winterbottom', 'Michel Gondry',
       'Michel Hazanavicius', 'Michelangelo Antonioni',
       'Michelangelo Frammartino', 'Miguel Arteta', 'Miguel Gomes',
       'Mike Judge', 'Mike Leigh', 'Mike Mills', 'Mira Nair',
       'Miranda July', 'Mohammad Rasoulof', 'Nadine Labaki', 'Nanni Moretti',
       'Naomi Kawase', 'Neill Blomkamp', 'Newton I. Aduaka',
       'Ngozi Onwurah', 'Nick Cassavetes', 'Nicolas Winding Refn',
       'Nikolaus Geyrhalter', 'Nina Paley', 'Noah Baumbach',
       'Nuri Bilge Ceylan', 'Oliver Schmitz', 'Olivier Assayas', 'Ossama Mohammed',
       'Pablo Berger', 'Paolo Benvenuti', 'Paolo Sorrentino',
       'Peter Watkins', 'Park Chan-wook', 'Patty Jenkins', 'Paul Feig', 'Paul Greengrass',
       'Paul Thomas Anderson', 'Pedro Costa', 'Pete Docter',
       'Pete Docter', 'Peter Jackson', 'Peter Mullan', 'Peter Tscherkassky', 'Peter Weir', 'Phil Solomon',
       'Philip Kaufman', 'Philippe Garrel', 'Pietro Marcello',
       'Pirjo Honkasalo', 'Quentin Tarantino', 'Rachid Bouchareb', 'Rajat Kapoor', 'Ramin Bahrani', 'Raoul Peck',
       'Raoul Ruiz', 'Rian Johnson', 'Richard Curtis', 'Richard Kelly', 'Richard Linklater', 
       'Ridley Scott', 'Rob Marshall', 'Robert Altman', 'Robert Greene',
       'Robert Pulcini', 'Robert Zemeckis', 'Roberto Benigni', 'Roman Polanski',
       'Ronit Elkabetz', 'Roy Andersson', 'Ryan Coogler', 'S. Craig Zahler', 'Sam Mendes', 'Sam Raimi',
       'Sarah Polley', 'Satoshi Kon', 'Scandar Copti', 'Shane Black', 'Shane Carruth', 'Shane Meadows', 
       'Sidney Lumet', 'Sion Sono', 'Sofia Coppola', 'Spike Jonze',
       'Spike Lee', 'Stephen Chow', 'Stephen Daldry', 'Stephen Frears', 'Steve McQueen', 'Steven Knight',
       'Steven Soderbergh', 'Steven Spielberg', 'Susanne Bier', 'Takahisa Zeze', 'Takeshi Kitano', 'Terence Davies',
       'Terrence Malick', 'Terry George', 'Terry Zwigoff', 'Thom Andersen',
       'Thomas Vinterberg', 'Todd Haynes', 'Tom Ford',
       'Tom Hooper', 'Tom McCarthy', 'Tomas Alfredson', 'Tomm Moore',
       'Tommy Lee Jones', 'Tony Gilroy', 'Tony Scott', 'Travis Wilkerson',
       'Trey Parker', 'Ming-liang Tsai', 'Uli Edel', 'Ulrich Seidl',
       'Valeska Grisebach', 'Vikramaditya Motwane', 'Vincent Paronnaud', 
       'Wayne Blair', 'Werner Herzog', 'Wes Anderson',
       'Ernie Gehr', 'Whit Stillman', 'Wolfgang Becker',
       'Kar-wai Wong', 'Woody Allen', 'Xavier Beauvois', 'Xavier Dolan',
       'Yann Arthus-Bertrand', 'Yervant Gianikian', 'Yimou Zhang']

In [144]:
urls = []
url_failed = []
for director in nonspecial_directors:
    search_d = director.lower().replace(" ","+")
    url_string = 'http://www.imdb.com/find?&q=' + search_d + '&s=all'
    search_name = director.lower()
    raw_html = urlopen(url_string).read()
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    next_url = soup_doc.find(class_='result_text')
    if next_url is not None:
        good_url = next_url.a['href']
        cleaner_url = good_url.split("?ref")
        if cleaner_url[0].startswith('/name'):
            name_link = {}
            name_link['name'] = director
            name_link['url'] = cleaner_url[0]
            urls.append(name_link)
        else:
            url_failed.append(search_name)
        print(cleaner_url)

['/name/nm0452102/', '_=fn_al_nm_1']
['/name/nm0444244/', '_=fn_al_nm_1']
['/name/nm0803066/', '_=fn_al_nm_1']
['/name/nm0193231/', '_=fn_al_nm_1']
['/name/nm0570912/', '_=fn_al_nm_1']
['/name/nm0002140/', '_=fn_al_nm_1']
['/name/nm0146760/', '_=fn_al_nm_1']
['/name/nm0327120/', '_=fn_al_nm_1']
['/name/nm0347492/', '_=fn_al_nm_1']
['/name/nm0720297/', '_=fn_al_nm_1']
['/name/nm2247200/', '_=fn_al_nm_1']
['/name/nm0812546/', '_=fn_al_nm_1']
['/name/nm1922735/', '_=fn_al_nm_1']
['/name/nm0314516/', '_=fn_al_nm_1']
['/name/nm0307497/', '_=fn_al_nm_1']
['/name/nm0668247/', '_=fn_al_nm_1']
['/name/nm1392994/', '_=fn_al_nm_1']
['/name/nm0036349/', '_=fn_al_nm_1']
['/name/nm0011470/', '_=fn_al_nm_1']
['/name/nm0231596/', '_=fn_al_nm_1']
['/name/nm0490487/', '_=fn_al_nm_1']
['/name/nm0004056/', '_=fn_al_nm_1']
['/name/nm1168657/', '_=fn_al_nm_1']
['/name/nm0906667/', '_=fn_al_nm_1']
['/name/nm0958558/', '_=fn_al_nm_1']
['/name/nm0000487/', '_=fn_al_nm_1']
['/name/nm1486975/', '_=fn_al_nm_1']
[

['/name/nm0661791/', '_=fn_al_nm_1']
['/name/nm0420941/', '_=fn_al_nm_1']
['/name/nm0082450/', '_=fn_al_nm_1']
['/name/nm0339030/', '_=fn_al_nm_1']
['/name/nm0000759/', '_=fn_al_nm_1']
['/name/nm0182276/', '_=fn_al_nm_1']
['/name/nm0230032/', '_=fn_al_nm_1']
['/name/nm0230032/', '_=fn_al_nm_1']
['/name/nm0001392/', '_=fn_al_nm_1']
['/name/nm0611932/', '_=fn_al_nm_1']
['/name/nm0874787/', '_=fn_al_nm_1']
['/name/nm0001837/', '_=fn_al_nm_1']
['/name/nm0813409/', '_=fn_al_nm_1']
['/name/nm0442241/', '_=fn_al_nm_1']
['/name/nm0308042/', '_=fn_al_nm_1']
['/name/nm2754122/', '_=fn_al_nm_1']
['/name/nm0393345/', '_=fn_al_nm_1']
['/name/nm0000233/', '_=fn_al_nm_1']
['/name/nm0098953/', '_=fn_al_nm_1']
['/name/nm0438494/', '_=fn_al_nm_1']
['/name/nm1023919/', '_=fn_al_nm_1']
['/name/nm0669704/', '_=fn_al_nm_1']
['/name/nm0749914/', '_=fn_al_nm_1']
['/name/nm0426059/', '_=fn_al_nm_1']
['/name/nm0193485/', '_=fn_al_nm_1']
['/name/nm0446819/', '_=fn_al_nm_1']
['/name/nm0000500/', '_=fn_al_nm_1']
[

In [146]:
for director in urls:
    final_url = 'http://www.imdb.com' + director['url']
    print(final_url)
    raw_html = urlopen(final_url).read()
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    director_info = soup_doc.find('div', id="name-born-info")
    director_info
    director_info.find_all('a')
    if len(director_info.find_all('a')) == 3:
        director['birth_date'] = director_info.find_all('a')[0].text
        director['birth_year'] = director_info.find_all('a')[1].text
        director['birth_place'] = director_info.find_all('a')[2].text
    else:
        director['birth_date'] = ''
        director['birth_year'] = director_info.find_all('a')[0].text
        director['birth_place'] = director_info.find_all('a')[1].text

http://www.imdb.com/name/nm0452102/
http://www.imdb.com/name/nm0444244/
http://www.imdb.com/name/nm0803066/
http://www.imdb.com/name/nm0193231/
http://www.imdb.com/name/nm0570912/
http://www.imdb.com/name/nm0002140/
http://www.imdb.com/name/nm0146760/
http://www.imdb.com/name/nm0327120/
http://www.imdb.com/name/nm0347492/
http://www.imdb.com/name/nm0720297/
http://www.imdb.com/name/nm2247200/
http://www.imdb.com/name/nm0812546/
http://www.imdb.com/name/nm1922735/
http://www.imdb.com/name/nm0314516/
http://www.imdb.com/name/nm0307497/
http://www.imdb.com/name/nm0668247/
http://www.imdb.com/name/nm1392994/
http://www.imdb.com/name/nm0036349/
http://www.imdb.com/name/nm0011470/
http://www.imdb.com/name/nm0231596/
http://www.imdb.com/name/nm0490487/
http://www.imdb.com/name/nm0004056/
http://www.imdb.com/name/nm1168657/
http://www.imdb.com/name/nm0906667/
http://www.imdb.com/name/nm0958558/
http://www.imdb.com/name/nm0000487/
http://www.imdb.com/name/nm1486975/
http://www.imdb.com/name/nm0

http://www.imdb.com/name/nm0813409/
http://www.imdb.com/name/nm0442241/
http://www.imdb.com/name/nm0308042/
http://www.imdb.com/name/nm2754122/
http://www.imdb.com/name/nm0393345/
http://www.imdb.com/name/nm0000233/
http://www.imdb.com/name/nm0098953/
http://www.imdb.com/name/nm0438494/
http://www.imdb.com/name/nm1023919/
http://www.imdb.com/name/nm0669704/
http://www.imdb.com/name/nm0749914/
http://www.imdb.com/name/nm0426059/
http://www.imdb.com/name/nm0193485/
http://www.imdb.com/name/nm0446819/
http://www.imdb.com/name/nm0000500/
http://www.imdb.com/name/nm0000631/
http://www.imdb.com/name/nm0551128/
http://www.imdb.com/name/nm0000265/
http://www.imdb.com/name/nm1914992/
http://www.imdb.com/name/nm0700301/
http://www.imdb.com/name/nm0000709/
http://www.imdb.com/name/nm0000905/
http://www.imdb.com/name/nm0000591/
http://www.imdb.com/name/nm0253813/
http://www.imdb.com/name/nm0027815/
http://www.imdb.com/name/nm3363032/
http://www.imdb.com/name/nm0951975/
http://www.imdb.com/name/nm0

In [154]:
urls

[{'birth_date': 'June 22',
  'birth_place': 'Tehran, Iran',
  'birth_year': '1940',
  'name': 'Abbas Kiarostami',
  'url': '/name/nm0452102/'},
 {'birth_date': 'December 7',
  'birth_place': 'Tunis, Tunisia',
  'birth_year': '1960',
  'name': 'Abdellatif Kechiche',
  'url': '/name/nm0444244/'},
 {'birth_date': 'October 13',
  'birth_place': 'Kiffa, Mauritania',
  'birth_year': '1961',
  'name': 'Abderrahmane Sissako',
  'url': '/name/nm0803066/'},
 {'birth_date': 'May 26',
  'birth_place': 'Dartford, Kent, England, UK',
  'birth_year': '1955',
  'name': 'Adam Curtis',
  'url': '/name/nm0193231/'},
 {'birth_date': 'April 17',
  'birth_place': 'Philadelphia, Pennsylvania, USA',
  'birth_year': '1968',
  'name': 'Adam McKay',
  'url': '/name/nm0570912/'},
 {'birth_date': 'November 28',
  'birth_place': 'Warsaw, Mazowieckie, Poland',
  'birth_year': '1948',
  'name': 'Agnieszka Holland',
  'url': '/name/nm0002140/'},
 {'birth_date': 'September 14',
  'birth_place': 'Vendôme, Loir-et-Cher, 

In [155]:
df = pd.DataFrame(urls)
df.to_csv('director_info.csv', index=False)
pd.read_csv('director_info.csv')

Unnamed: 0,birth_date,birth_place,birth_year,name,url
0,June 22,"Tehran, Iran",1940,Abbas Kiarostami,/name/nm0452102/
1,December 7,"Tunis, Tunisia",1960,Abdellatif Kechiche,/name/nm0444244/
2,October 13,"Kiffa, Mauritania",1961,Abderrahmane Sissako,/name/nm0803066/
3,May 26,"Dartford, Kent, England, UK",1955,Adam Curtis,/name/nm0193231/
4,April 17,"Philadelphia, Pennsylvania, USA",1968,Adam McKay,/name/nm0570912/
5,November 28,"Warsaw, Mazowieckie, Poland",1948,Agnieszka Holland,/name/nm0002140/
6,September 14,"Vendôme, Loir-et-Cher, France",1931,Alain Cavalier,/name/nm0146760/
7,,"Paris, France",1972,Alain Gomis,/name/nm0327120/
8,July 15,"Villefranche-de-Rouergue, Aveyron, France",1964,Alain Guiraudie,/name/nm0347492/
9,June 3,"Vannes, Morbihan, France",1922,Alain Resnais,/name/nm0720297/


In [156]:
df['url'] = 'http://www.imdb.com' + df['url']
df

Unnamed: 0,birth_date,birth_place,birth_year,name,url
0,June 22,"Tehran, Iran",1940,Abbas Kiarostami,http://www.imdb.com/name/nm0452102/
1,December 7,"Tunis, Tunisia",1960,Abdellatif Kechiche,http://www.imdb.com/name/nm0444244/
2,October 13,"Kiffa, Mauritania",1961,Abderrahmane Sissako,http://www.imdb.com/name/nm0803066/
3,May 26,"Dartford, Kent, England, UK",1955,Adam Curtis,http://www.imdb.com/name/nm0193231/
4,April 17,"Philadelphia, Pennsylvania, USA",1968,Adam McKay,http://www.imdb.com/name/nm0570912/
5,November 28,"Warsaw, Mazowieckie, Poland",1948,Agnieszka Holland,http://www.imdb.com/name/nm0002140/
6,September 14,"Vendôme, Loir-et-Cher, France",1931,Alain Cavalier,http://www.imdb.com/name/nm0146760/
7,,"Paris, France",1972,Alain Gomis,http://www.imdb.com/name/nm0327120/
8,July 15,"Villefranche-de-Rouergue, Aveyron, France",1964,Alain Guiraudie,http://www.imdb.com/name/nm0347492/
9,June 3,"Vannes, Morbihan, France",1922,Alain Resnais,http://www.imdb.com/name/nm0720297/


In [157]:
df.to_csv('director_info.csv', index=False)
pd.read_csv('director_info.csv')

Unnamed: 0,birth_date,birth_place,birth_year,name,url
0,June 22,"Tehran, Iran",1940,Abbas Kiarostami,http://www.imdb.com/name/nm0452102/
1,December 7,"Tunis, Tunisia",1960,Abdellatif Kechiche,http://www.imdb.com/name/nm0444244/
2,October 13,"Kiffa, Mauritania",1961,Abderrahmane Sissako,http://www.imdb.com/name/nm0803066/
3,May 26,"Dartford, Kent, England, UK",1955,Adam Curtis,http://www.imdb.com/name/nm0193231/
4,April 17,"Philadelphia, Pennsylvania, USA",1968,Adam McKay,http://www.imdb.com/name/nm0570912/
5,November 28,"Warsaw, Mazowieckie, Poland",1948,Agnieszka Holland,http://www.imdb.com/name/nm0002140/
6,September 14,"Vendôme, Loir-et-Cher, France",1931,Alain Cavalier,http://www.imdb.com/name/nm0146760/
7,,"Paris, France",1972,Alain Gomis,http://www.imdb.com/name/nm0327120/
8,July 15,"Villefranche-de-Rouergue, Aveyron, France",1964,Alain Guiraudie,http://www.imdb.com/name/nm0347492/
9,June 3,"Vannes, Morbihan, France",1922,Alain Resnais,http://www.imdb.com/name/nm0720297/


In [205]:
movies = ['12 Years a Slave', '2046', '24 Hour Party People', '25th Hour',
       '3-Iron', '35 Shots of Rum', '4 Months, 3 Weeks & 2 Days',
       '5 Broken Cameras', '5x2', '678', '7 Letters', '99 Homes',
       'A Borrowed Identity', 'A Commuters Life',
       'A Girl Walks Home Alone at Night', 'A History of Violence',
       'A Letter to Nelson Mandela',
       'A Pigeon Sat on a Branch Reflecting on Existence', 'A Prophet',
       'A Separation', 'A Serious Man', 'A Single Man',
       'A Tale of Two Sisters', 'A Time for Drunken Horses',
       'A Touch of Sin', 'A Vinganca de Uma Mulher',
       'AI: Artificial Intelligence', 'About Elly', 'Actress',
       'Adaptation', 'Adventureland', 'After the Wedding', 'Ajami',
       'Alexandre Sokurov', 'All Divided Selves', "Almayer's Folly",
       'Almost Famous', 'Alpha Dog', 'American Hustle', 'American Psycho',
       'American Splendor', 'Amores Perros', 'Amour', 'Amelie',
       'An Injury to One', 'Anchorman: The Legend of Ron Burgundy',
       'Animal Kingdom', 'Ankhon Dekhi', 'Anomalisa', 'Another Year',
       'Antichrist', 'Apocalypto',
       'Arabian Nights: Volume 1', 'Archipelago',
       'As I Was Moving Ahead Occasionally I Saw Brief Glimpses of Beauty',
       'At the First Breath of Wind', 'Attack the Block', 'Aurora',
       'Avatar', 'Babel', 'Bad Education',
       'Bad Lieutenant: Port of Call New Orleans', 'Bamboozled', 'Barbara',
       'Battle Royale', 'Battle in Heaven', 'Be Kind Rewind',
       'Beasts of the Southern Wild', 'Before Midnight', 'Before Sunset',
       "Before the Devil Knows Youre Dead", 'Beginners', 'Belle',
       'Bend It Like Beckham', 'Best in Show', 'Beyond the Hills',
       'Birdman', 'Birth', 'Black Swan', 'Blissfully Yours',
       'Blue Is the Warmest Color', 'Blue Valentine', 'Bone Tomahawk',
       'Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan',
       'Bowling for Columbine', 'Boyhood', 'Brick', 'Bridesmaids',
       'Bright Star', 'Brokeback Mountain', 'Brooklyn', 'Burning Bush',
       'Butter on the Latch', 'Cache', 'Cafe Lumiere',
       'Capitalism: Child Labor', 'Captain America: Winter Soldier',
       'Caramel', 'Carlos', 'Carol', 'Casino Royale', 'Cast Away',
       'Cave of Forgotten Dreams', 'Centochiodi', 'Certified Copy',
       'Chakde! India', 'Che', 'Chicago', 'Children of Men',
       'Chuck & Buck', 'City of God', 'Climates', 'Closed Curtain',
       'Collateral', 'Colossal Youth', 'Coma', 'Coming Home',
       'Conversations on a Sunday Afternoon', 'Cosmos', 'Court', 'Crank',
       'Crimson Gold', 'Crouching Tiger, Hidden Dragon',
       'Cuba: An African Odyssey', 'Dancer in the Dark', 'Daylight Moon',
       'Days of Glory', 'Death Proof', 'Detention', 'Dev D',
       'Dirty Pretty Things', 'Distant', 'District 9',
       'Divine Intervention', 'Django Unchained', 'Dogtooth', 'Dogville',
       'Donnie Darko', 'Dreams of a Life', 'Drive', 'Deja Vu',
       'Eat, Sleep, Die', 'Eden', 'Eldorado', 'Elena', 'Elephant',
       'Enter the Void', 'Eternal Sunshine of the Spotless Mind',
       'Even If She Had Been a Criminal', 'Everyone Else',
       'Evolution of a Filipino Family', 'Ex Machina',
       'Extraordinary Stories', 'Ezra', 'Fados', 
       'Far From Heaven', 'Fat Girl', 'Father of My Children',
       'Femme Fatale', 'Film socialisme', 'Finding Nemo', 'Fish Tank',
       'Footnote', 'Four Lions', 'Frances Ha', 'From What Is Before',
       'Frontier of Dawn', 'Frozen', 'Frozen River', 'Fruitvale Station',
       'Gangs of Wasseypur', 'Gerry', 'Gett: The Trial of Viviane Amselem',
       'Ghost World', 'Girl Walk: All Day', 'Girlhood', 'Gladiator',
       'Godzilla', 'Gone Girl', 'Good Bye Lenin!',
       'Good Night and Good Luck', 'Goodbye to Language',
       'Goodbye, Dragon Inn', 'Gosford Park', 'Gostanza da Libbiano',
       'Gran Torino', 'Gravity', 'Grizzly Man', 'Hard to Be a God',
       'Hazaaron Khwaishein Aisi', 'Head-On', 'Heart of a Dog',
       'Heaven Knows What', "Heavens Story", 'Hedwig and the Angry Inch',
       'Her', 'Hero', 'High Fidelity', 'Hijack Stories', 'Holy Motors',
       'Hooligan Sparrow', 'Horse Money', 'Hotel Rwanda',
       'House of Flying Daggers', 'How to Survive a Plague', 'Hugo',
       'Human', 'Hunger', 'I Am Love', "I Dont Want to Sleep Alone",
       'I Travel Because I Have To, I Come Back Because I Love You', 'Ida',
       'Idiocracy', 'Import Export', 'In Jackson Heights',
       'In Praise of Love', 'In Vandas Room', 'In the Family',
       'In the Mood for Love', 'Incendies', 'Inception',
       'Infernal Affairs', 'Inglourious Basterds', 'Inherent Vice',
       'Inland Empire', 'Inside Llewyn Davis', 'Inside Out',
       'Instructions for a Light and Sound Machine', 'Into Great Silence',
       'Irreversible', "Its Such a Beautiful Day", 'Im Going Home',
       'Im Not There', 'Jackass 3D', 'Japon', 'Jersey Boys',
       'Journey to the West', 'Katyn', 'Kill Bill: Vol. 1',
       'Kill Bill: Vol. 2', 'Kill List', 'Kings and Queen',
       'Kiss Kiss Bang Bang', 'Knight of Cups', 'Kung Fu Hustle',
       'LSD: Love, Sex Aur Dhokha', 'La Cienaga',
       'La commune (Paris, 1871)', 'Lagaan: Once Upon a Time in India',
       'Le filmeur', 'Le quattro volte', 'Let the Right One In',
       'Letters From Iwo Jima', 'Leviathan', 'Life of Pi', 'Lifeline',
       'Like Someone In Love', 'Lilya 4-Ever', 'Lincoln',
       'Listen to Me Marlon', 'Locke', 'Longing',
       'Los Angeles Plays Itself', 'Lost and Beautiful',
       'Lost in Translation', 'Lourdes', 'Love & Basketball',
       'Love & Friendship', 'Love Actually', 'Love Exposure', 'Lumumba',
       'Lust, Caution', 'Mad Max: Fury Road',
       "Madagascar 3: Europes Most Wanted", 'Making Of', 'Man on Wire',
       'Manakamana', 'Manchester by the Sea', 'Maqbool', 'Margaret',
       'Master and Commander: The Far Side of the World', 'Match Point',
       'Me and You and Everyone We Know', "Meeks Cutoff", 'Melancholia',
       'Memento', 'Memories of Murder', 'Men at Work', 'Mia Madre',
       'Miami Vice', 'Michael Clayton', 'Michelangelo Eye to Eye',
       'Middle of Nowhere', 'Millennium Actress', 'Millennium Mambo',
       'Million Dollar Baby', 'Miners Shot Down', 'Minority Report',
       'Mommy', 'Monsoon Wedding', 'Monster', 'Monsters, Inc.', 'Moolaade',
       'Moon', 'Moonrise Kingdom', 'Morvern Callar', 'Mosaik mecanique',
       'Mother', 'Moulin Rouge!', 'Mr Turner', 'Mulholland Drive',
       'Munich', 'My Golden Days', 'My Heart Beats Only for Her',
       'My Winnipeg', 'Mysteries of Lisbon', 'Mystic River', 'Nebraska',
       'Neighboring Sounds', 'Night Across the Street', 'Night Will Fall',
       'Nightcrawler', 'Nine Queens', 'No', 'No Country For Old Men',
       'No Home Movie', "No Mans Land", 'Nobody Knows',
       'Norte, the End of History', 'Nostalgia for the Light',
       'Notre musique', 'O Brother, Where Art Thou?', 'Oasis',
       'Of Gods and Men', 'Oh! Man', 'Oldboy', 'Once',
       'Once Upon a Time in Anatolia', 'Only Lovers Left Alive',
       'Open Range', 'Our Daily Bread', 'Paan Singh Tomar', 'Pain & Gain',
       "Pan's Labyrinth", 'Paradise Now', 'Persepolis', 'Phoenix',
       'Platform', 'Poetry', 'Police, Adjective', 'Primer',
       'Profit Motive and the Whispering Wind',
       "Psalm III: 'Night of the Meek'", 'Pulse', 'Punch-Drunk Love',
       'Quills', 'Ratatouille', 'Red Leaves', 'Red Road', 'Regular Lovers',
       'Requiem for a Dream', 'Restless City', 'Revolutionary Road',
       'Right Now, Wrong Then', 'Russian Ark', 'Rust and Bone',
       'Samson & Delilah', 'Sangue', 'Saraband', 'Secret Sunshine',
       'Secret Things', 'Selma', 'Sembene', 'Senna', 'Seventh Code',
       'Sexe, gombo et beurre sale', 'Sexy Beast', 'Shame', 'Shara',
       'Shaun of the Dead', 'Shoot the Messenger', 'Short Term 12',
       'Shrek', 'Shutter Island', 'Sicario', 'Sideways', 'Silent Light',
       'Silent Souls', 'Silver Linings Playbook', 'Sin Nombre',
       'Sita Sings the Blues', 'Snow White', 'Something Necessary',
       'Somewhere', 'Son of Saul', 'Song of the Sea',
       'Songs from the Second Floor', 'Sparrow', 'Spider-Man 2',
       'Spirited Away', 'Spotlight', 'Spring Breakers',
       'Spring, Summer, Fall, Winter and Spring', 'Star Trek',
       'Star Wars: Episode III', 'Star Wars: Episode VII', 'Step Brothers',
       'Still Life', 'Still Walking',
       'Storia di una donna amata e di un assassino gentile',
       'Stories We Tell', 'Story of My Death', 'Stranger by the Lake',
       'Stray Dogs', 'Sumas y Restas', 'Summer Hours', 'Sweet Sixteen',
       'Sympathy for Mr Vengeance', 'Syndromes and a Century',
       'Synecdoche, New York', 'Tabu', 'Take Shelter', 'Take This Waltz',
       'Talk to Her', 'Talladega Nights: The Ballad of Ricky Bobby',
       'Tangerine', 'Taxi Tehran', 'Team America: World Police', 'Ten',
       'Terra', 'Tey', 'The 3 Rooms of Melancholia', 'The Act of Killing',
       'The Arbor', 'The Artist', 'The Assassin',
       'The Assassination of Jesse James by the Coward Robert Ford',
       'The Avengers', 'The Baader Meinhof Complex', 'The Babadook',
       'The Bands Visit', 'The Beat That My Heart Skipped',
       'The Blind Swordsman: Zatoichi', 'The Bourne Ultimatum',
       'The Box of Life', 'The Captive', 'The Century of the Self',
       'The Child', 'The Circle', 'The Clock',
       'The Colonial Misunderstanding', 'The Comedy', 'The Congress',
       'The Consequences of Love', 'The Curious Case of Benjamin Button',
       'The Dark Knight', 'The Days When I Do Not Exist',
       'The Death of Mr Lazarescu', 'The Deep Blue Sea', 'The Departed',
       'The Descendants', 'The Diving Bell and the Butterfly',
       'The Edge of Heaven', 'The External World', 'The Five Obstructions',
       'The Forest for the Trees', 'The Fourth Watch', 'The Future',
       'The Gatekeepers', 'The Ghost Writer', 'The Gleaners and I',
       'The Grand Budapest Hotel', 'The Great Beauty', 'The Hateful Eight',
       'The Headless Woman', 'The Holy Girl', 'The Host', 'The Hours',
       'The House of Mirth', 'The Hunt', 'The Hurt Locker', 'The Imposter',
       'The Incredibles', 'The Intruder', 'The Kings Speech',
       'The Lady and the Duke', 'The Last of the Unjust',
       'The Lives of Others', 'The Lobster', 'The Look of Silence',
       'The Lord of the Rings: The Fellowship of the Ring',
       'The Lord of the Rings: The Return of the King',
       'The Magdalene Sisters', 'The Man Without A Past', 'The Master',
       'The Matchmaker', 'The Namesake', 'The New World', 'The Orphanage',
       'The Pianist', 'The Piano Teacher', 'The Prestige',
       'The Profession of Arms', 'The Queen', 'The Return', 'The Revenant',
       'The Romance of Astrea and Celadon', 'The Royal Tenenbaums',
       'The Sapphires', 'The Secret in Their Eyes', 'The Skin I Live In',
       'The Sky Trembles and the Earth Is Afraid and the Two Eyes Are Not Brothers',
       'The Social Network', 'The Son', "The Sons Room",
       'The Squid and the Whale', 'The Story of Marie and Julien',
       'The Strange Case of Angelica', 'The Stuart Hall Project',
       'The Sun Also Rises', 'The Taste of Others',
       'The Three Burials of Melquiades Estrada', 'The Tiger and the Snow',
       'The Time That Remains', 'The Tree of Life', 'The Turin Horse',
       'The White Meadows', 'The White Ribbon', 'The Wind Rises',
       'The Wolf of Wall Street', 'The World', 'The Wrestler', 'The Yards',
       'Theeb', 'There Will Be Blood', 'These Encounters of Theirs',
       'This Is England', 'This Is Not a Film', 'This Is the End',
       'Thithi', 'Three Monkeys', 'Three Times',
       'Tie Xi Qu: West of the Tracks', 'Timbuktu', 'Time Out',
       'Tinker Tailor Soldier Spy', 'To Die Like a Man', 'Toni Erdmann',
       'Tony Manero', 'Toy Story 3', 'Traffic', 'Triple Agent',
       'Tropic Thunder', 'Tropical Malady', 'Twixt', 'Two Lovers', 'Udaan',
       'Un lac', 'Uncle Boonmee Who Can Recall His Past Lives',
       'Under the Skin', 'United 93', 'Vera Drake', 'Vincere', 'WALL-E',
       'Waiting for Happiness', 'Waking Life', 'Waltz with Bashir',
       "Warming by the Devils Fire", 'We Need to Talk About Kevin',
       'Weekend', 'Wendy and Lucy', 'Werckmeister Harmonies',
       'Wet Hot American Summer', 'What Time Is It There?',
       'When I Saw You', 'When the Levees Broke: A Requiem in Four Acts',
       'Where Do We Go Now?', 'Whiplash', 'White Material', 'Wild',
       'Wild Tales', 'Winter Sleep', "Winters Bone",
       'Wolff Von Amerongen: Did He Commit Bankruptcy Offences?',
       'Woman on the Beach', 'World of Tomorrow', 'Y Tu Mama Tambien',
       'Yi Yi: A One and a Two', 'Yossi & Jagger',
       'You Aint Seen Nothin Yet', 'You Can Count On Me',
       'You Can Count on Me Me', 'You Will Meet a Tall Dark Stranger',
       'You, The Living', 'Young Adult', 'Zero Dark Thirty', 'Zodiac']

In [165]:
m_urls = []
m_failed = []
for movie in movies:
    search_m = movie.lower().replace(" ","+")
    url_string = 'http://www.imdb.com/find?&q=' + search_m +'&s=all'
    search_name = movie.lower()
    raw_html = urlopen(url_string).read()
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    next_url = soup_doc.find(class_='result_text')
    if next_url is not None:
        good_url = next_url.a['href']
        cleaner_url = good_url.split("?ref")
        if cleaner_url[0].startswith('/title'):
            movie_link = {}
            movie_link['name'] = movie
            movie_link['url'] = cleaner_url[0]
            urls.append(m_urls)
        else:
            url_failed.append(m_failed)
        print(cleaner_url)

['/title/tt2024544/', '_=fn_al_tt_1']
['/title/tt0212712/', '_=fn_al_tt_1']
['/title/tt0274309/', '_=fn_al_tt_1']
['/title/tt0307901/', '_=fn_al_tt_1']
['/title/tt0423866/', '_=fn_al_tt_1']
['/title/tt1100048/', '_=fn_al_tt_1']
['/title/tt1032846/', '_=fn_al_tt_1']
['/title/tt2125423/', '_=fn_al_tt_1']
['/title/tt0354356/', '_=fn_al_tt_1']
['/title/tt1764141/', '_=fn_al_tt_1']
['/title/tt4691166/', '_=fn_al_tt_1']
['/title/tt2891174/', '_=fn_al_tt_1']
['/title/tt2841572/', '_=fn_al_tt_1']
['/title/tt6366476/', '_=fn_al_tt_1']
['/title/tt2326554/', '_=fn_al_tt_1']
['/title/tt0399146/', '_=fn_al_tt_1']
['/title/tt1883180/', '_=fn_al_tt_1']
['/title/tt1235166/', '_=fn_al_tt_1']
['/title/tt1832382/', '_=fn_al_tt_1']
['/title/tt1019452/', '_=fn_al_tt_1']
['/title/tt1315981/', '_=fn_al_tt_1']
['/title/tt0365376/', '_=fn_al_tt_1']
['/title/tt0259072/', '_=fn_al_tt_1']
['/title/tt2852400/', '_=fn_al_tt_1']
['/title/tt2050453/', '_=fn_al_tt_1']
['/title/tt0212720/', '_=fn_al_tt_1']
['/title/tt1

['/title/tt0118694/', '_=fn_al_tt_1']
['/title/tt1255953/', '_=fn_al_tt_1']
['/title/tt1375666/', '_=fn_al_tt_1']
['/title/tt0338564/', '_=fn_al_tt_1']
['/title/tt0361748/', '_=fn_al_tt_1']
['/title/tt1791528/', '_=fn_al_tt_1']
['/title/tt0460829/', '_=fn_al_tt_1']
['/title/tt2042568/', '_=fn_al_tt_1']
['/title/tt2096673/', '_=fn_al_tt_1']
['/title/tt0493428/', '_=fn_al_tt_1']
['/title/tt0478160/', '_=fn_al_tt_1']
['/title/tt0290673/', '_=fn_al_tt_1']
['/title/tt2396224/', '_=fn_al_tt_1']
['/title/tt0283422/', '_=fn_al_tt_1']
['/title/tt0368794/', '_=fn_al_tt_1']
['/title/tt1116184/', '_=fn_al_tt_1']
['/title/tt0322824/', '_=fn_al_tt_1']
['/title/tt1742044/', '_=fn_al_tt_1']
['/title/tt2017561/', '_=fn_al_tt_1']
['/title/tt0879843/', '_=fn_al_tt_1']
['/title/tt0266697/', '_=fn_al_tt_1']
['/title/tt0378194/', '_=fn_al_tt_1']
['/title/tt1788391/', '_=fn_al_tt_1']
['/title/tt0344273/', '_=fn_al_tt_1']
['/title/tt0373469/', '_=fn_al_tt_1']
['/title/tt2101383/', '_=fn_al_tt_1']
['/title/tt0

In [170]:
for movie in m_urls:
    final_url = 'http://www.imdb.com' + movie['url']
    print(final_url)
    raw_html = urlopen(final_url).read()
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    movie_genres = soup_doc.find('div', class_="inline")
    movie_genres
    movie_genres.find_all('a')
    if len(movie_genres.find_all('a')) == 1:
        movie['genre_1'] = movie_genres.find_all('a')[0].string
    elif len(movie_genres.find_all('a')) == 2:
        movie['genre_1'] = movie_genres.find_all('a')[0].string
        movie['genres_2'] = movie_genres.find_all('a')[1].string
    elif len(movie_genres.find_all('a')) == 3:
        movie['genre_1'] = movie_genres.find_all('a')[0].string
        movie['genres_2'] = movie_genres.find_all('a')[1].string
        movie['genres_3'] = movie_genres.find_all('a')[2].string
    elif len(movie_genres.find_all('a')) == 4:
        movie['genre_1'] = movie_genres.find_all('a')[0].string
        movie['genres_2'] = movie_genres.find_all('a')[1].string
        movie['genres_3'] = movie_genres.find_all('a')[2].string
        movie['genres_4'] = movie_genres.find_all('a')[3].string
    elif len(movie_genres.find_all('a')) == 5:
        movie['genre_1'] = movie_genres.find_all('a')[0].string
        movie['genres_2'] = movie_genres.find_all('a')[1].string
        movie['genres_3'] = movie_genres.find_all('a')[2].string
        movie['genres_4'] = movie_genres.find_all('a')[3].string
        movie['genres_5'] = movie_genres.find_all('a')[4].string